Post on 30-Aug-2019
Friedrich-Alexander-Universitat Erlangen-Nurnberg
Lehrstuhl fur Multimediakommunikation und
Signalverarbeitung
Prof. Dr.-Ing. Walter Kellermann
Masterarbeit
Ego-noise reduction for the robot NAOusing multichannel nonnegative matrix
factorization
von Benjamin Thum
September 2015
Betreuer: Dr. Antoine Deleforge
Erklarung
Ich versichere, dass ich die vorliegende Arbeit ohne fremde Hilfe und
ohne Benutzung anderer als der angegebenen Quellen angefertigt habe,
und dass die Arbeit in gleicher oder ahnlicher Form noch keiner an-
deren Prufungsbehorde vorgelegen hat und von dieser als Teil einer
Prufungsleistung angenommen wurde. Alle Ausfuhrungen, die wortlich
oder sinngemaß ubernommen wurden, sind als solche gekennzeichnet.
————————————
Ort, Datum
————————————
Unterschrift
CONTENTS I
Contents
Abstract VII
List of Abbreviations IX
Notation X
1 Introduction 1
2 Literature review of audio source separation 6
2.1 Models for audio source separation . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Instantaneous and convolutive mixing . . . . . . . . . . . . . . . 7
2.1.2 Amplitude and phase ambiguity . . . . . . . . . . . . . . . . . . 8
2.1.3 Mixing by spatial source images . . . . . . . . . . . . . . . . . . 9
2.1.4 Spatial covariance matrix . . . . . . . . . . . . . . . . . . . . . 9
2.1.5 Characteristics of source separation algorithms . . . . . . . . . . 10
2.2 Properties that allow the separation of sources . . . . . . . . . . . . . . 10
2.2.1 Spectral diversity . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Spatial diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.3 Non-guassianity and non-stationarity of sources . . . . . . . . . 12
2.3 Common approaches to audio source separation . . . . . . . . . . . . . 13
2.4 (Almost) blind source separation . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 Independent Component Analysis (ICA) . . . . . . . . . . . . . 13
2.5 Model based source separation via Gaussian processes . . . . . . . . . . 15
II
2.5.1 Model for single channel scenarios . . . . . . . . . . . . . . . . . 16
2.5.2 Model for multichannel scenarios . . . . . . . . . . . . . . . . . 20
2.6 The permutation problem . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Literature review on matrix factorization 27
3.1 Cost functions for matrix factorization . . . . . . . . . . . . . . . . . . 28
3.1.1 Generalized α divergence as a Csisz’ars divergences . . . . . . . 29
3.1.2 The β divergence as a Bregman divergence . . . . . . . . . . . . 30
3.1.3 Generalized α-β divergence . . . . . . . . . . . . . . . . . . . . 31
3.2 Properties of (nonnegative) matrix factorization . . . . . . . . . . . . . 34
3.2.1 Indeterminacies . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.2 Local optimality . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Solving the matrix factorization problem . . . . . . . . . . . . . . . . . 35
3.3.1 Nonnegative matrix factorization via multiplicative update rules 35
3.3.2 Multichannel EM algorithm for the sum of Gaussian components 38
3.4 Constraints and variants of matrix factorization . . . . . . . . . . . . . 39
3.4.1 Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4.2 Smoothness and temporal continuity . . . . . . . . . . . . . . . 41
3.4.3 Local NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.4 Semi-orthogonal NMF . . . . . . . . . . . . . . . . . . . . . . . 42
3.4.5 Fischer NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4.6 Scaling problem with a single constraint . . . . . . . . . . . . . 43
4 Ego-noise suppression for the robot NAO 44
4.1 Ego-noise analysis of NAO . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.1 Fan and static noise . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.2 Movement noise . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.1.3 Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Reduction of stationary components as preprocessing . . . . . . . . . . 53
4.2.1 Exploiting the high sparsity in the frequency domain to reduce fan spike 55
CONTENTS III
4.2.2 Static noise reduction by single channel wiener filter . . . . . . . 58
4.3 Reduction of non-stationary ego-noise . . . . . . . . . . . . . . . . . . . 58
4.3.1 Informed source separation by pretraining of the dictionary . . . 58
4.3.2 Approximate additivity assumption . . . . . . . . . . . . . . . . 60
4.3.3 Euclidean MF with a single dictionary . . . . . . . . . . . . . . 63
4.3.4 Euclidean MF with two dictionaries . . . . . . . . . . . . . . . . 65
4.3.5 Dictionary training based on desired and undesired data . . . . 68
4.4 Multiplicative Update (MU) algorithm for multichannel data . . . . . . 70
4.5 The challenge of multiple mixtures . . . . . . . . . . . . . . . . . . . . 72
4.5.1 Multichannel Wiener filter (MWF) . . . . . . . . . . . . . . . . 74
4.5.2 Parametric multichannel Wiener filter (PMWF) . . . . . . . . . 75
4.5.3 MVDR beamformer . . . . . . . . . . . . . . . . . . . . . . . . . 76
5 Experimental results 78
5.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2 Prefiltering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2.1 Fan noise reduction . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2.2 Static and residual fan noise reduction . . . . . . . . . . . . . . 80
5.2.3 Prefiltering results . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2.4 Evaluation approach for movement reduction . . . . . . . . . . . 82
5.3 Movement noise reduction . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3.1 Experimental results for single channel data . . . . . . . . . . . 85
5.3.2 Performance of PMWF for spatial preprocessing . . . . . . . . . 85
5.3.3 Experimental results for multichannel data . . . . . . . . . . . . 88
5.3.4 Discussion on using NMF with MWF as spatial prefilter . . . . 89
5.3.5 Performance of MVDR beamformer for spatial preprocessing . . 91
6 Conclusion 93
IV
A Single channel results 96
A.1 walking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
A.1.1 α-β-divergence with speech and noise codebook . . . . . . . . . 96
A.1.2 MF with the α-β-divergence and Euclidean predictive by a noise codebook 97
A.2 waving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
A.2.1 α-β-divergence speech and noise codebook . . . . . . . . . . . . 99
A.2.2 MF with the α-β-divergence and Euclidean predictive by a noise codebook100
B Multichannel results 102
B.1 walking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
B.1.1 MF with the α-β-divergence, MWF as prefilter and speech and noise codebook102
B.1.2 MF with α-β-divergence and Euclidean predictive, MWF as prefilter and noise codeb
B.1.3 Multichannel matrix factorization with the α-β-divergence . . . 105
B.2 waving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
B.2.1 PMWF as prefilter . . . . . . . . . . . . . . . . . . . . . . . . . 106
B.2.2 MF with the α-β-divergence, MWF as prefilter and speech and noise codebook107
B.2.3 MF with the α-β-divergence, MWF as prefilter and noise codebook108
B.2.4 Multichannel matrix factorization with the α-β-divergence . . . 109
C Overview on related algorithms 111
C.0.5 Minimum Mean Square Error (MMSE) estimation . . . . . . . . 111
C.0.6 Wiener filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
C.0.7 Nonnegative Graph Embedding . . . . . . . . . . . . . . . . . . 115
C.1 Terminology of non-blind source separation algorithms . . . . . . . . . 116
D Derivations 118
D.1 Parameter estimation by matrix factorization . . . . . . . . . . . . . . 118
D.1.1 Euclidean distance as ML estimation of additive Gaussian noise 118
D.1.2 ML estimation of the sum of Poisson distributed magnitude components119
D.1.3 Itakura-Saito divergence as ML estimation of gamma multiplicative noise121
CONTENTS V
D.2 Deriving the α-divergence as a Csisz’ars divergence . . . . . . . . . . . 122
D.3 Deriving the β-divergence as a Bregman divergence . . . . . . . . . . . 123
D.4 Derivation of MU rules for the α-β divergence (single channel) . . . . . 125
D.5 Derivation of a multichannel EM algorithm for the Itakura-Saito divergence 128
D.6 Computational growth for parallel matrix inversion . . . . . . . . . . . 130
References 131
ABSTRACT VII
Abstract
This thesis studies ego-noise reduction for the robot NAO using multichannel non-
negative matrix factorization (NMF). The movement ego-noise, the robot creates by
walking on the spot as well as by waving its arm, is investigated. It is shown, that
a significant reduction of these ego-noises on a single channel is possible using a re-
cently proposed NMF algorithm based on the generalized α-β divergence. The existing
approach of multichannel multiplicative update rules, where the mutual information
between the channel is discarded, is then derived for the generalized α-β divergence
as cost function. This multichannel NMF approach results in worse performance than
applied to a single channel.
Furthermore a novel approach to separate the reduction in a spatial prefilter and a
single channel NMF algorithm is investigated. Experimental results show that the
large speech artifacts introduced by the spatial filter significantly degrade a subsequent
single channel NMF algorithm. This puts forward room for improvement.
From a theoretical point of view, it is also pointed out that current approaches of
pretraining the dictionary in the literature are not well suited for most signals. A new
algorithm based on NMF with the Euclidean cost function is developed. In the design,
the pretraining of the dictionaries with the reference signals is taken into account. The
experimental results suggest that ego-noise reduction with NMF could benefit from
more suitable training algorithms.
VIII ABSTRACT
Kurzfassung
Diese Arbeit erforscht die Eigengerauschreduktion fur den Roboter NAO mittels nicht-
negativer Matrixfaktorisierung (NMF). Dabei werden die Bewegungsgerausche unter-
sucht, die der Roboter erzeugt, wenn er sich auf der Stelle bewegt, sowie wenn er
seinen rechten Arm bewegt. Es wird gezeigt, dass eine signifikante Verringerung des
Eigenrauschens mit einem einzigen Kanal moglich ist. Dabei wird ein vor kurzem
veroffentlichter NMF Algorithmus mit der generalisierten α-β Divergenz verwendet.
Der existierende Ansatz mehrkanaliger multiplikativer Update-Regeln, bei denen der
mittlere Transinformationsgehalt vernachlassigt wird, wird fur die generalisierte α-β
Divergenz verallgemeinert. Experimentelle Ergebnisse zeigen, dass dieser Algorithmus
eine schlechtere Performance erreicht als mit einem einzigen Kanal.
Daruber hinaus wird ein neuer Ansatz untersucht, in dem der Algorithmus in einen
raumlichen Filter und einen einkanaligen NMF Algorithmus getrennt wird. Die Ex-
perimente zeigen, dass die großen Artefakte des raumlichen Filters den nachfolgenden
NMF Algorithmus signifikant beeintrachtigen.
Eine theoretische Untersuchung zeigt, dass die aktuellen Herangehensweisen in der Lit-
eratur, bei denen das Worterbuch vorher trainiert wird, fur die meisten Signale nicht
geeignet sind. Dabei wird ein neuartiger Algorithmus auf Grundlage von NMF mit der
euklidischen Kostenfunktion entwickelt. In dessen Design ist das Training des Worter-
buchs durch Referenzsignale berucksichtigt. Die experimentelle Auswertung legt nahe,
dass die Eigenrauschunterdruckung durch NMF von besser geeigneten Trainingsalgo-
rithmen profitieren kann.
ABKRZUNGSVERZEICHNIS IX
List of Abbreviations
BSS blind source separation
DFT discrete Fourier transform
EM expectation maximization
i.i.d. independent and identically distributed
IIR infinite impulse response
MF matrix factorization
ML maximum likelihood
MMSE minimum mean square error
MSE mean square error
MU multiplicative update
MVDR minimum variance distortionless response
MWF multichannel Wiener filter
NMF nonnegative matrix factorization
pdf probability distribution function
psd power spectral density
PMWF parametric multichannel Wiener filter
RAR residual to artifact ratio
RAR residual to artifact ratio
SAR speech/signal to artifact ratio
SDR speech/signal to distortion ratio
SIR speech/signal to interference ratio
STFT short time Fourier transform
X FORMELZEICHEN
Notation
Notation for signals
Y or Y single/multiple microphone mixture samples
X or X single/multiple (hidden) microphone source samples
S or S single/multiple (hidden) actual source samples
A single sample is always represented by a normal capital letter and usually accom-
panied with one or more indices. A vector or matrix of observations is indicated by a
bold capital letter and possibly also used with indices. For the definition of a column
vector the following notation is used
Y = [Y1, ..., YI ]T ∆= [Yi]I
Parameters and indices
K number of codewords and therefore size of codebook
F number of frequencies
R number of frames
U number of mixtures/microphones
TR to indicate training data
TE to indicate test data
N or S to indicate speech or test data
The capital letters describe the size of a variable and the corresponding small letters
are used to index it. For example f is used to index the f -th frequency. Similarly t is
used to index time (samples).
FORMELZEICHEN XI
Mathematical expressions
c= equal up to a constantm= equal up to a scalar multiplication
∆= is defined as
!= has to be
Y .[b] element wise b-th power
Y .X element wise matrix multiplication
CHAPTER 1. INTRODUCTION 1
Chapter 1
Introduction
The rapid development of computational power by computers in the recent years has
led to a fast spread of electric devices into many different areas of human life - in the
industry, in our homes, in transportation devices and even in our pocket. In particular
the manufacturing of human-like robots, with their potential task of assisting humans
in their daily life. One of these humanoid robots is called NAO and it is developed
by Aldebaran Robotics. In order to become human-like, a robot must however be
capable of performing many different tasks, walking, talking, listening, to name a few.
But also more complex tasks such as understanding and follow a given order are desired.
This thesis investigates one of these tasks, speech enhancement. Algorithms designed
to understand speech are usually referred to as automatic speech recognition engines. In
recent years, efficient algorithms have been developed for scenarios with low reverbera-
tion and background noise and are nowadays spread in many smartphones around the
world. The early algorithms severely lost performance in more challenging scenarios,
such as multiple speakers, high reverberation or high background noise. For this rea-
son, the development of robust speech recognition algorithms or suitable preprocessing
algorithms for these more challenging scenarios is still an ongoing research topic.
In robot audition - a general term used for tasks such as speech recognition or speaker
2
localization for robots - a quite unique problem arises. The ego-noise, meaning the
noise the robot creates itself, has to be taken into account as well. When processing
real world audio signals, the static noise of the microphones and sometimes also the
noise of a fan has to be taken into account. Both have to be considered as ego-noise,
too. But conventional applications, such as for example smartphones, do not create
audio noise from mechanical movement.
Hence mechanical movement ego-noise is uncommon and in addition particularly chal-
lenging for the following two reasons. First, the intensity of the movement noise, as for
example in the case of the robot NAO, it is significantly louder than the speech signal.
The robots sound might not appear so loud for a person standing in the same room as
a robot, but the distance to the noise source has to be taken into account. In case of
the NAO robot, the microphones are spaced less than one meter from the noise origin.
Second, the movement of the robot is quite versatile. It is very difficult to predict, as
its properties such as spectral shape or total energy vary strongly over time.
The ego-noise reduction problem for the robot NAO can, as it is usually also done for
other robots, be separated into the reduction of non-stationary and stationary noise. It
is quite common to tackle these two types by two different algorithms, see for example
[Inc11], [LBD+14]. Although this work has to deal with both stationary and non-
stationary noises, the main focus is on the reduction of the non-stationary, movement
noise.
Ego-noise reduction systems (especially for the non-stationary noise) can again be sep-
arated into either relying only on the audio data or including additional information.
It is especially of interest to consider motor data, such as position, velocity and accel-
eration of its yaws and other moving parts. This work solely focuses on approaches
using the audio data only. An overview on ego-noise reduction exploiting sensor data
can for example be found in [Inc11].
The reduction of non-stationary noise form non-stationary speech can be seen as a
3
special case of audio source separation, where only one source, namely the speech
signal, is of interest. Due to this interpretation, (audio) source separation algorithms
are in theory also applicable for the task of ego-noise reduction. As the investigated
robot has four and thus multiple microphones, audio source separation techniques for
multiple signals are of interest.
There exist a number of different approaches for audio source separation. The focus
on this thesis lies on one particular source separation type. The non-stationarity is
modeled in a particular way, such that the resulting optimization problem can be
interpreted as a non-negative matrix factorization (NMF) problem.
NMF itself is in its basic idea the factorization (or approximation) of a big nonnega-
tive matrix into the multiplication of two smaller matrices. In the last 15 years NMF
has reached quite some attention in the literature and has been successfully applied
to neural science, music separation or pattern recognition tasks as for example face
recognition.
The task of this thesis can thus be formulated as the following: It shall be investi-
gated, whether multichannel audio source separation techniques that can be solved as
a nonnegative matrix factorization problem such as [OF10] are suitable for the ego-
noise reduction of the robot NAO. The results shall then be compared to the recently
proposed ego-noise reduction method in [DK14].
The ego-noise reduction is hereby considered in a very special scenario. It is assumed,
that suitable training material is given. Suitable training material contains hereby
a recording of the ego-noise which is similar to the movement the robot is about to
perform. In addition a training recording of the desired speech signal is presumed and
the spatial properties are assumed to be time independent. Speaker and robot have to
be positioned at the same positions in the same room for training and testing .
A related real world scenario is for example the NAO robot waiting patiently in a room
until it is talked to by a speaker. The first communication, before the robot starts to
4
move, can then be seen as a training data for the spatial features of the speaker.
Training data for the movement noise can be created beforehand and therefore be
used for the corresponding movements. The assumption that the robot only performs
movements where the spatial properties are roughly unchanged is quite restrictive. The
robot can not do much except for waving its hand or walking on the spot. However,
this work focuses on the applicability of NMF for ego-noise reduction in this kind of
artificial scenario and not on the tracking of and adaptation to the spatial characteristic
of the speaker.
The novel ideas in this work are the following.
• In chapter 4.2.1 a algorithm for the reduction of fan spikes is proposed, which
exploits the spike bandwidth of few Hertz.
• In chapter 4.3.2 the idea of the approximate additivity assumption is further
developed and evaluated by the example of Euclidean matrix factorization.
• The approach used to pretrain the dictionary is not well suited for most signals
and a better suited algorithm based on NMF with the Euclidean distance is
derived in chapter 4.3.5.
• An existing multichannel MU algorithm for the Itakura-Saito divergence is mod-
ified to solve the problem with the generalized α-β divergence in chapter 4.4.
• Furthermore in chapter 4.5 a two stage approach with spatial prefilter and single
channel NMF algorithm is discussed.
• In the experimental results chapter audios source separation by NMF is for the
first time applied to the robot ego-noise reduction problem.
The thesis is structured as follows. The literature review of works related to the given
problem of ego-noise reduction is divided into chapter 2 for audio source separation
and chapter 3 for the overview of matrix factorization.
5
Chapter 4 describes then the overall structure of the noise reduction algorithm and
contains the earlier mentioned novel part of this thesis.
In chapter 5 the experimental results with real recordings of the robot are shown for
the algorithms selected in the last chapter.
A conclusion and proposals for future research is given in chapter 6
6 CHAPTER 2. LITERATURE REVIEW OF AUDIO SOURCE SEPARATION
Chapter 2
Literature review of audio source
separation
Many of the well-known noise suppression techniques assume stationarity or at least
long enough short-term stationarity such that it is sufficient to estimate the noise
characteristics for example in the speech pauses. However if the noise is only short
time stationary for the same time span as the speech signal or less, these algorithms
no longer result in a good noise suppression performance.
For this scenario, the problem of noise reduction can be seen as a source separation
problem of two short time stationary processes, although strictly speaking the estimate
on the noise is not of interest. But as source separation can provide an estimate of the
speech signal from the mixture, audio source separation algorithms are appropriate for
the presented ego-noise reduction problem.
2.1 Models for audio source separation
In source separation it is common to name the unknown signals sources and the ob-
served signals mixtures. Further-on the number of sources is denoted by J and the
number of mixtures by U . In the robots ego-noise reduction scenario, the mixtures are
simply given by the microphone signals.
2.1. MODELS FOR AUDIO SOURCE SEPARATION 7
2.1.1 Instantaneous and convolutive mixing
The mixing model describes the relation between the sources and the mixtures. De-
pending on the environment the mixing is, as described in the following, either modeled
as instantaneous or by convolution.
Instantaneous mixing as for example used in [PC01] is the simplest model. Each
mixture is the weighted sum of the J sources, such that Yu,t =∑
j GujSjt. The mixing
can be written as a set of linear equations using Yt = [Yu,t]U and St = [Sjt]J
Yt = GSt (2.1)
and G = [Guj ]U×J is referred to as mixing matrix.
Unfortunately audio mixtures in real rooms can not be described well by instantaneous
mixing. Instead the recorded mixtures are sums of the reverberated sound sources.
Reverberation is well described by a convolution with the mixing filters, or in this case
room impulse responses G. The convolutive mixing model will be denoted by
Yut =∑
j
Guj ∗ Sjt (2.2)
as there exists no suitable matrix notation as for the instantaneous mixing. In the
above form, the convolutive mixing is very difficult to handle. It is therefore often
approximated by a complex valued multiplication in the STFT domain. [VBGB14]
names this approximation narrowband assumption. The mixing process it then written
as
Yfr = GfSfr (2.3)
where each frequency is modeled by its own mixing matrix. [PC01]
If one is able to truly identify the U × J mixing matrix G and it the Moore-Penrose
8
pseudo inverse exists, the sources can be extracted from the mixtures by
St = G+Yt (2.4)
Note that this also implies that the sources are properly scaled in case of instantaneous
mixing or (approximately) dereverberated in case of convolutive mixing in the STFT
domain.
Dereverberation itself is however a very challenging task and has frequently been
addressed in literature. In recent years, nonnegative matrix factorization based ap-
proaches have also been developed, for example [KSRS11] for single channel and [MH14]
for multichannel scenarios. Though the mixing model as described above is in theory
capable of dereverebration, its practical use is limited due to the mixing models am-
plitude and phase ambiguity.
2.1.2 Amplitude and phase ambiguity
If not stated otherwise, all algorithms in this work, which are based on the above mixing
models, lack the ability to determine the amplitude and sign/phase of source signal
and mixing filter properly. Assume that observations are created by the instantaneous
mixing in the time domain as Yjt = GjSjt =Gj
c(cSjt) = GjSjt. This reformulation
is valid for any constant c, such that all filter source pairs {Gj
c, cSjt} are equally valid
representation of the observations Yjt. Source and mixing coefficients can therefore
only be determined up to a signed scaling factor.
In a similar way this can be done for the convolutive mixing in the time-frequency
domain. As these observations are typically complex, the constant can be complex as
well. The ambiguity is thus over amplitude and phase.
If this ambiguity is not solved, a proper scaling/dereverberation is of course impossible.
For many algorithms this ambiguity is solved by adding some constraints to source or
propagation vectors. In [HO00] the source variance can for example by constrained to
unity using E{SjfrSHjfr} = 1
2.1. MODELS FOR AUDIO SOURCE SEPARATION 9
2.1.3 Mixing by spatial source images
To circumvent the amplitude and phase ambiguity [Car98b] introduced the concept of
spatial source images (though the name was introduced later in [VAT+12]) by defining
Xujt = GujSjt, Xujt = Guj ∗ Sjt and Xjfr = GjfSjfr for the three above models, such
that
Yt =∑
j
Xjt or Yfr =∑
j
Xjfr (2.5)
for the time or STFT domain respectively. Xuj can be interpreted as the actual con-
tribution of the j-th source onto the u-th mixture. The model is for example used in
[VBGB14]. In this report it is used to avoid the scaling and phase ambiguity in the
single channel case.
Note that in the definitions so far, the mixing models are assumed to be independent
of time and therefore rather easy to estimate. In real world scenarios for ego-noise re-
duction robot and speakers move around. The mixing vectors are thus time dependent.
This thesis is however dedicated to time independent mixing vectors, which proves to
be a challenging task already.
2.1.4 Spatial covariance matrix
The mixing models of equation 2.2 are furthermore limited to single point sources, as
each source is defined by a single mixing filter per mixture. They are therefore not
suited for spatially diffuse sources. One way to address this challenge (for Gaussian
processes) is via the spatial covariance matrix. In chapter 2.5 it is shown that the
spatial covariance matrix of a single source with the convolutive mixing model is
ΣG,jf = E{GjfGHjf} (2.6)
10
and thus of rank 1 where Gjf = [Gujf ]U . In [DVG10] a full rank spatial covariance
matrix is used to model diffuse sources.
2.1.5 Characteristics of source separation algorithms
One important assumption, made in (almost) all source separation techniques is the
independence of the source signals. Besides this assumption and the mixing model,
there exist several more properties, that can be used to describe a source separation
technique. A specific source separation problem can for example be characterized by
the following properties:
• Convolutive or instantaneous
• Overdetermined or underdetermined, depending on the ratio of sources to mix-
tures
• Stationary or non-stationary/short-time stationary) sources
• Gaussian or non-Gaussian observations
• Temporal correlation / spectral shape
Consequently there exist a lot of different methods for source separation. Nevertheless
their concept is similar in general, as they exploit at least one of the properties described
in the next chapter.
2.2 Properties that allow the separation of sources
[Car01] points out that two or more sources cannot be separated if they are stationary,
Gaussian distributed and white, thus if they are i.i.d. Gaussian. In order to make
separation possible, the sources need to fulfill certain properties. In accordance with
[VBGB14], the two most prominent properties will be called spectral diversity and
spatial diversity.
2.2. PROPERTIES THAT ALLOW THE SEPARATION OF SOURCES 11
2.2.1 Spectral diversity
In the above example, the sources can still not be separated by applying a filter and
therefore spectral shaping of the mixture. The problem is thus not the whiteness of the
signals, but more precisely the same spectral shape. The property of signals having
different spectral shape will further on be referred to as spectral diversity, which is
equivalent to different time correlation [Car01].
Roughly speaking, when the probabilities of a certain frequency component being gen-
erated by the sources are different, it is possible to distribute each frequency component
onto the sources in some optimum manner. For example by assuming that each fre-
quency component is created by a single source, the source with the highest probability
has to be selected for each component.
In chapter 2.5.1 a model is introduced that allows exploiting the spectral diversity of
sources through a low-rank approximation of their magnitude squared STFT matrix
2.2.2 Spatial diversity
Besides the possibility to separate sources based on different spectral characteristics,
there exists another possibility if more than one (related) signals are observed. In audio
source separation multiple signals are usually achieved with a microphone array. As
source separation can only benefit from an microphone array, if the sources are located
at different positions in space, this property will be further-on referred to as spatial
diversity.
Spatial diversity is usually exploited for source separation by
• estimating the mixing matrix (algorithms to do so are described later in this
thesis).
• estimation of the source signal by the pseudo inverse of the mixing matrix (see
equation 2.4).
12
2.2.3 Non-guassianity and non-stationarity of sources
Non-gaussianity was since its very beginnings used for blind signal separation (see
for example [Car98a]). Some authors ([Car01]) claim that source separation can be
achieved by exploiting non-stationarity.
However it has to be noted that in all work reviewed for this thesis, non-gaussianity or
non-stationarity of a source was only used to identify spectral or spatial parameters.
Thus the source separation, as for example in chapter 2.4.1, was actually performed
based on spectral or spatial parameters. This has also been pointed out by [PC01] in
their discussion about non-stationary and non-Gaussian models
Though it is not exploited in any of the reviewed literature, one can think of source
separation based on the sources (non-Gaussian) probability distribution without spa-
tial and spectral diversity. This is for example given by two random processes with
statistical independence over time, which are observed by a single mixture. Based
on the probability distribution of the two sources, there still exist situations, where
nontrivial estimates of the two sources can be obtained. As a simple example, two
random processes are added up, one creates either plus or minus 1 for each sample
and the other one plus or minus 10. The resulting observations are thus from the set
{−11,−9, 9, 11}. Obviously for any observations the realizations of the two underlying
processes can be determined. This work will not further investigate such scenarios, but
they merely suggest that spatial and spectral diversity are not the only possibilities for
source separation.
A final interesting remark is, that non-gaussianity and non-stationarity can be used
equivalently in some cases [PC01]. In [Li06] the author points out that by applying a
time-dependent linear filter to Gaussian i.i.d. processes, the output can either be seen
as a non-stationary Gaussian process or as a stationary non-Gaussian process.
Concluding this chapter, all unsupervised algorithms reviewed for this paper perform
source separation by exploiting the spectral and/or spatial diversity of the sources.
The identification of the diversity parameters (e.g. mixing matrix) is then performed
2.3. COMMON APPROACHES TO AUDIO SOURCE SEPARATION 13
by exploiting non-stationarity or non-gaussianity of the sources.
2.3 Common approaches to audio source separation
In this chapter the two most common approaches to source separation are introduced,
blind source separation and model based source separation.
2.4 (Almost) blind source separation
An algorithm will be termed almost blind or blind when it does not explicitly model
the sources as processes with a certain probability distribution (see chapter 2.5) and
it does not use reference signals (see chapter 4.3.1). Still blind source separation al-
gorithms usually require some assumptions, for example about non-Gaussianity or
non-stationarity and statistical independence of the sources
2.4.1 Independent Component Analysis (ICA)
The most common type of blind source separation problems, is the Independent Com-
ponent Analysis (ICA). Its main concept, as it is presented in [HO00], is to extract
sources out of the mixture, which are again independent. It thus directly exploits the
independence of the sources to identify the spatial diversity of the sources.
Recall the instantaneous mixing model from equation 2.1
Yt = GSt
where St represents now an vector of J statistical independent sources. ICA and
related algorithms can also be applied in the STFT domain to account for convolutive
mixtures. These methods can be referred to as frequency dependent ICA (FDICA).
See [VJA+10] and [VBGB14] or for more information.
If U ≤ J and therefore the underdetermined case, there exists at least one matrix
14
V = G+ such that
St = V Yt (2.7)
Conventional ICA methods can therefore not be used for separation of overdetermined
scenarios (J > U). Approaches to these scenarios are often termed sparse component
analysis (SCA) and we refer to [VJA+10] for further information.
However there may exist another matrix VP = PV such that
SP,t = PSt (2.8)
As SP,t and St are both statistical independent, P has to be a permutation matrix.
The permutation problem will be discussed in chapter 2.6, so that at this point it is
sufficient to calculate an demixing matrix by restoring statistical independence.
One now needs an algorithm that calculates the matrix V such that the sources in
St have minimum mutual information. As pointed out earlier, source separation is
impossible for i.i.d. Gaussian processes. In order to achieve separation, an additional
assumption has to be made.
ICA based non-gaussianity of the sources results from the problem that if more than
one source is Gaussian, the mixing matrix can only be estimated up to a orthogonal
transformation. [HO00] [Car98a]
Statistical independence can then be minimized by higher order moments. These type
of ICA algorithms can be seen as a generalization of the principle component analysis,
which ensure statistical independence only up to the second moment.[Com94]
Again we want to point out, that spatial diversity is exploited to separate the sources
and the higher order moments and thus non-gaussianity is only used to identify the
spatial parameters.
For completeness, we point out, that non-Gaussianity is not the only working assump-
tion for ICA. In [PC01] the authors exploit the non-stationarity of the sources to
estimate the mixing matrix.
2.5. MODEL BASED SOURCE SEPARATION VIA GAUSSIAN PROCESSES 15
At this point we are not interested in specific algorithms to obtain the independent
components, but only on the concept of ICA in order to compare it with the other
approaches shown later in this thesis. For more detail on actual ICA algorithm we
forward to the many works cited in this chapter.
2.5 Model based source separation via Gaussian
processes
In contrast to blind source separation, model based source separation is based on
models for the sources. After the parameters of the model are estimated, parameters
and model are used to estimate the actual source signals.
Hereby the model has to be chosen according to the actual signal.
Statistical independent Gaussian processes are of particular interest, as their (joint)
pdf depends on first and second order moments only. This greatly limits the number
of parameters that have to be identified. Furthermore the observations resulting from
this processes are jointly Gaussian and it is known that the optimum MMSE estimator
is linear (see appendix C.0.5).
It is well known, that speech or audio signals in general can be considered to be short
time stationarity for about 5 to 20 ms. As the focus here is on audio source separation,
it is further on assumed, that at least one source is short time stationary.
Due to the limited stationarity of at least one source, it is very common to apply
the algorithm in the STFT domain. Each frame is approximately stationary and esti-
mated separately. Note that this approach implicitly assumes independence between
the frames. Furthermore it is very common to assume independence between the fre-
quency bins. The applied filter can then be interpreted as the STFT approximation of
the optimum wiener filter (more details to this matter are given in appendix C.0.5).
16
As this is (almost) never completely given with real data, it has to be noted that in the
STFT domain the wiener filter loses its optimality strictly speaking. Nevertheless, as
it works reasonable well, it is the most common approach for a short time stationary
Gaussian processes.
2.5.1 Model for single channel scenarios
Recall the spatial sound source model of equation 2.5
Yfr =∑
j
Xjfr
Due to the statistical independence over frequency and frames and the jointly Gaussian
observations, the estimate on the j-th spatial sound source in the STFT domain by a
single mixture is given by
Xjfr = E{Xjfr|Yfr} =E{YfrX
Hjfr}
E{YfrYHfr }
Yfr (2.9)
Introducing the variance or psd in the STFT domain VX,jfr = E{XjfrXHjfr} and using
the independence of the sources, equation 2.9 simplifies to
Xjfr =VX,jfr
∑
j VX,jfr
Yfr (2.10)
Estimating the PSDs
The optimum source estimation filter is given by the single channel Wiener filter of
equation 2.10. However in order to apply the filter the variance parameters VX,jfr have
to be determined. As already pointed out, speech is only short-time stationary. It is
therefore further on assumed, that at least one source is short time stationary.
Separation of a non-stationary source from one or more stationary source(s)
This is the common scenario for suppression of (almost) stationary noise on speech
recordings. The most popular algorithm for this task is the following:
2.5. MODEL BASED SOURCE SEPARATION VIA GAUSSIAN PROCESSES 17
When the speech (and noise) is active, the combined speech and noise variance is
recursively calculated by
Vspeech+noise,fr = αVspeech+noise,f(r−1) + (1− α)|Yfr|2 (2.11)
Hereby the recursion parameter α has to be chosen to account how far the variance of
the speech varies between frames. As extreme case it can be set to zero, in which case
the variance is estimated by its instantaneous value.
When the speech is not active, one can set Vspeech,fr = 0 in order to get zero output
(or Vspeech,fr = ǫ to get some comfort noise). The speech pauses can then be used to
reestimate the psd of really slow changing noise or if it does not change at all, it can
be determined beforehand.
Finally the estimated speech is obtained by reformulating equation 2.10 to
Xspeech,fr =Vspeech+noise,fr − Vnoise,fr
Vspeech+noise,frYfr (2.12)
Separation of non-stationary sources
If several sources are short term stationary the approach applied in the section chapter
is no longer suitable. The variance parameters for all sources have to be estimated at
once. The problem can be described as estimating JF R variance parameters out of
FR observations in the single channel case and UF R observations in the multichannel
case. This is especially problematic in underdetermined mixtures (J > U).
One approach to tackle this problem is to replace the variance with the instantaneous
power spectogram and reduce the number of parameters that have to be estimated by
nonnegative matrix factorization (NMF). Though NMF is often applied with different
cost functions and therefore without a model, there exists a model that is suitable to
introduce NMF as a model-based source separation technique. The NMF model is
introduced in the following chapter.
18
Sum of short-time stationary Gaussian components (single mixture)
Given a single mixture (U = 1), the model of independent Gaussian time-frequency-
bins can be evolved to the generative model first introduced by [BDBG03]. Each
frame-frequency-bin of the source Xjfr is represented by Kj components Cjkfr:
Xjfr =
Kj∑
k
Cjkfr (2.13)
Furthermore each component can be described by a time (or frame) dependent gain
A1
2
jkr and a stationary complex Gaussian distributed process with variance Djkf . This
can be seen as a model for the non-stationarity via amplitude modulated stationary
processes. By shifting the gain of the components into the variance, the model can be
equivalently described ([FBD09] [OF10]) as
Cjkfr ∼ N (0, DjfkAjkr) (2.14)
with a frame dependent variance. N (x|µ,Σ) = det(πΣ)−1e−(x−µ)HΣ−1(x−µ) hereby
denotes the proper multivariate complex Gaussian distribution.
As the spatial sound sources are assumed to be mutually and individually independent
across frequencies and frames, they are also Gaussian and their mean and variances
add up such that
Xjfr ∼
Kj∑
k
N (0, DjfkAjkr) = N (0,DTjfAjr) (2.15)
The mixture is the sum of the independent spatial sound sources, such that
Yfr =
J∑
j
Xjfr ∼ N (0,
J∑
j
DTjfAjr) (2.16)
By defining joint dictionaries and activation matrices for all sources
Df = [DTjf ]J ; (2.17)
Ar = [ATjr]J (2.18)
2.5. MODEL BASED SOURCE SEPARATION VIA GAUSSIAN PROCESSES 19
and the total number of components K =∑J
j Kj .
The mixture is then distributed in the STFT domain by
Yfr ∼ N (0,DTf Ar) (2.19)
ML estimation of variance parameters
[FBD09] describes the maximum likelihood estimate on Y given D and A by
CML(D,A) = p(Y |D,A) (2.20)
Due to the independence assumption p(Y |D,A) =∏fr
fr p(Yfr|Df ,Ar)
The negative log-likelihood function is given by
CML(D,A) = −
fr∑
fr
log p(Yfr|Df ,Ar)RF log(π) +
fr∑
fr
logDTf Ar +
|Yfr|2
DTf Ar
(2.21)
c=
fr∑
fr
dIS(|Yfr|2||DT
f Ar) (2.22)
Note that as |Y |.2 is the instantaneous variance or instantaneous power and its estimate
is given as
VY,fr = DTf Ar =
J∑
j
DTjfAjr =
J∑
j
VXjfr (2.23)
CML(D,A) =
fr∑
fr
dIS(|Yfr|2||VY,fr) (2.24)
Due to the equivalence of NMF with the Itakura-Saito model to ML estimation of
the sum of Gaussian components (equation 2.21), it is the most commonly used NMF
technique for audio source separation. Yet there exist also other models that result in
solving an NMF problem. An overview is given in appendix D.1. An algorithm that
solves the NMF problem with the Itakura-Saito divergence is given in chapter 3.3.1.
20
It is also important to note, that the processes of a source are assumed to be indepen-
dent, but the variances are dependent. Even more they are explicitly modeled jointly
in the time-frequency domain.
By modeling each source as a sum of Kj frame-wise stationary Gaussian components
with frame varying variance, one estimates KF +KR parameters instead of JF R out
of FR samples. In that sense, the NMF model can be interpreted as exploiting the
spectral diversity. The variance parameters of source J are sparse in the time-frequency
domain as DjAj is for Kj < min(R,F ) of rank Kj and therefore not of full rank. The
NMF model can also be compared with a stationary Gaussian process with variance
Vjfr = Vjf . The stationary Gaussian process is simply represented as a rank 1 STFT
variance matrix. The resulting activation vector is the unity vector.
If more than one mixture is observed, a similar algorithm can be derived to exploit
both spatial and spectral diversity.
2.5.2 Model for multichannel scenarios
Recall the convolutive mixing model in the STFT domain from equation 2.3
Yfr = GfSfr
The time frequency bins are not independent between the mixtures such that the
optimum linear estimator is jointly calculated for the U mixture observations Yfr =
[Yufr]U as
Sjfr = E{Sjfr|Yfr} = E{SjfrYHfr }E{YfrY
Hfr }
−1Yfr = VSjfrGHjfΣ
−1Y,frYfr (2.25)
The covariance matrices of sources and observations are hereby defined as
ΣX,jfr = ΣG,jfVS,jfr (2.26)
ΣY,fr = E{YfrYHfr } = E{GfSfr(GfSfr)
H} =∑
j
VS,jfrΣG,jf (2.27)
2.5. MODEL BASED SOURCE SEPARATION VIA GAUSSIAN PROCESSES 21
with the spatial covariance matrix ΣG,jf = E{GjfGHjf} = GjfG
Hjf and the source
variance VS,jfr = E{SjfrSHjfr}
As alternative the spatial sound sources can be estimated as:
Xjfr = Gjf Sjfr = ΣG,jfVS,jfrΣ−1Y,frYfr = ΣX,jfrΣ
−1Y,frYfr (2.28)
Equation 2.28 describes the multichannel Wiener filter in the STFT domain and is used
in the source separation context for example in [VBGB14] and [LDDR13]. It denotes
an approximation of the optimum filter for mixtures of Gaussian processes, with the
assumptions made above.
The optimum estimation filter is already given in equation 2.28, but as in the single
channel case the variance parameters have to be estimated by a suitable model. In
[VJA+10] this is referred to as variance modeling, in contrast to linear models, which are
not discussed in this work. The spatial modeling via the spatial covariance matrix was
already mentioned in chapter 2.1.4. The authors of [VJA+10] furthermore distinguish
the spectral modeling inmonophonic and polyphonic spectral models. Gaussian mixture
models (GMMs), hidden markov models (HMMs) and extensions such as Gaussian
scaled mixture models (GSMM) are listed as monophonic spectral models and we
refer to [VJA+10] for an overview on these methods. Our interest is at the polyphonic
spectral models, as their factorization of the power spectrogram again leads to the NMF
algorithms as described in the last chapter for a single mixture and in the following for
multiple mixtures.
Modeling multiple mixtures by the sum of short time stationary Gaussian
components
[OF10] was the first to model multiple mixtures as the sum of short time stationary
Gaussian processes. The model results from the combination of the multichannel mix-
ing model of equation 2.3 and equation 2.13 and an additional additive spatial Gaus-
sian white noise term for each mixture Bufr with diagonal spatial covariance matrix
22
ΣB,fr = diag([VB,ufr]U ). The multichannel model then writes as
Yfr = GC,fCfr +Bfr (2.29)
The difference to the mixing model is, that instead of J independent Gaussian sources
the mixture is composed of K independent Gaussian components (where each com-
ponent belongs to one of the J sources) and an additional Gaussian noise term. The
variance of the components is modeled in the NMF typical way by VC,kfr = DfkAkr.
The optimum filter of equation 2.25 can also be formulated for the components as
Ckfr = VC,kfrGHjkf
(ΣY,fr)−1Yfr (2.30)
with ΣY,fr =∑
k∈K
VCkfrΣG,kf +ΣB,fr (2.31)
Note that in [OF10] the mixing vectors of components of the same source are identical,
such that Gk1f = Gk2f if k1, k2 ∈ Kj . Therefore jk denotes the source that component
k belongs to.
The estimate of a source can then simply be obtained by Sj,fr =∑
k∈KjCkfr.
Again, in order to apply the filter, the spatial parameters G and for the component
variances expressed via D,A have to be determined beforehand.
Estimation of parameters by discarding the mutual information
When the mutual information between the mixtures is discarded the multichannel cost
function simply is the sum of the equation 2.24 over all mixtures
CML((G,D,A) =∑
uf,r
dIS(|Yufr|2||VYufr) (2.32)
In the single channel case, the estimated variance of the observation was simply given
by the sum of the variances of the spatial sound images. In the original mixing model,
where the mixing matrices are used, dictionary and activation coefficients form the
variance of the components by VS,jfr =∑
k∈KjDfkAkr. The components impact on
2.5. MODEL BASED SOURCE SEPARATION VIA GAUSSIAN PROCESSES 23
the variance of the u-th mixture is then obtained by multiplication with the squared
mixing coefficient. As the components are assumed to be statistical independent, their
variances add up, such that in the noiseless case, VB,fr = 0, the variance of a single
mixtures time frequency bin can be written as
VY,ufr =∑
j
|Gujf |2∑
k∈Kj
DfkAkr (2.33)
[OF10] algorithm to optimize equation 2.32 regarding the parameters G, D, A is
reviewed in 4.4.
Estimation of parameters via the EM approach
The authors also propose a second algorithm, where the mutual information between
the channels is not discarded. They perform maximum likelihood estimate on the
complete data {Y ,C} as
CML(G,D,A) = p(Y ,C|G,D,A) (2.34)
Note that while the observations Y are known (observable), the components C are
hidden and can only be estimated (for example by 2.30). As shown later this models
therefore results in an expectation maximization (EM) algorithm.
Equation 2.35 can be reformulated using Bayes rule as
CML(G,D,A) = p(Y |C,G,D,A)p(C|G,D,A) (2.35)
The negative log-likelihood is then given by
− log p(Y |C,G,D,A) =
fr∑
fr
− log p(Yfr|Cfr,GC,f ,Df ,Ar)− log p(Cfr|GC,f ,Df ,Ar)
(2.36)
Given the components Cfr and all parameters, the observations are simply distributed
with the probability distribution of the additive noise Bfr around the mean vector
24
defined by the components as GC,fCfr
− log p(Yfr|Cfr,GC,f ,Df ,Ar) = − logN (Yfr|GC,fCfr,ΣB,f)
= log π + log(det(ΣB,fr)) + (Yfr −GC,fCfr)HΣ−1
B,f(Yfr −GC,fCfr)(2.37)
As the components are also Gaussian distributed and statistical independent, their
negative log-likelihood can be written as
− log p(Cfr|GC,f ,Df ,Ar) =∑
k
− logN (Ckfr|0, Df,kAk,r)
=∑
k
log π + log(Df,kAk,r) +|Ckfr|2
Df,kAk,r
(2.38)
The negative log-likelihood is thus equal up to a constant to
− log p(Y |C,G,D,A)
c=
fr∑
fr
log(det(ΣB,fr)) + (Yfr −GC,fCfr)HΣ−1
B,f (Yfr −GC,fCfr)
+∑
k
log(Df,kAk,r) +|Ckfr|2
Df,kAk,r
(2.39)
[OF10] algorithms to optimize this cost function via an EM algorithm is reviewed in
3.3.2.
2.6 The permutation problem
So far source separation was considered for the case, that only the mixtures are observ-
able. Obviously this includes, that no labeled reference signals are used. A recording
with only the ego-noise would for example be considered as a labeled reference for
the ego-noise. Algorithms that are not only based on the observed mixtures, but on
additional recordings will further on be termed guided or informed. The most common
approach for NMF is discussed in chapter 4.3.1. An algorithm can be guided/informed,
regardless of assumptions, for example on statistical properties or an underlying model
as for example by random processes.
2.6. THE PERMUTATION PROBLEM 25
For the spatial diversity in ICA the permutation problem was already pointed out in
[Car98a]. [HO00] gives a formal prove for the ambiguity in the order of the components.
Given the mixing model Y = GS and the assumption that the sources S are statistical
independent and there exists a demixing matrix W = G+ such that WY = S.
However there exists also a matrix W = WP such that WY = PS. When P is
only a permutation matrix, the demixed signals PS are still statistical independent.
Hence W and W are both optimum solutions regarding statistical independence. This
is closely related to the statement of [Car98a] about a theorem of Darmois.
By using statistical independence as optimization criterion, the correct order of the
sources cannot be determined. [VBGB14]
Spatial diversity can also be interpreted as extracting a number of point sources dis-
tributed in space. As stated earlier, the order of the sources is random. The random
ordering prevents that several point sources can be combined correctly into a single
overall source. With this idea, the transition to the equivalent permutation problem
of the spectral diversity is rather straightforward. The spectral equivalence to a point
source in space is a certain frequency pattern. Assume that K frequency patterns or
components have been successfully extracted. Same as for the spatial diversity, the or-
der of the components is random. It is known that the K components form J sources
by addition. But, without exploiting additional data, there is no information available,
which components belong to which source.
The formulation of the permutation problem is no longer straightforward, if spatial and
spectral diversity are exploited simultaneously. The ambiguity of the components is
solved in the multichannel NMF algorithm described in chapter 2.5.2, as all components
of one source share the same mixing filters.
A component k1 of source jk1 with mixing vector Gujk1, code word Dk1 and activation
vector Ak1 leads to the estimated variance VX,uj1k1rf = |Gujk1f|2Dfk1Ak1r. For a given
26
mixture u the same variance can easily be constructed by a component k2 of a different
source jk2 , such that
VXuj2k2,rf = |Gujk2f
|2Dfk2Ak2r = VX,uj1rf (2.40)
by setting Aj2k2r = Aj1k1r and
Dj2fk2 = Dj1fk1
|Guj1f |2
|Guj2f |2
(2.41)
But equation 2.41 has to be fulfilled for all mixtures, such that component k1 of source
j1 leads to the same variance parameter as component k2 of source j2. This is in general
not granted, as a component’s contribution cannot be equivalently represented by a
component of another source.
At this point it is however not certain, what happens if J >> U . For this evaluation
it could be beneficial to interpret the problem as a nonnegative tensor factorization
problem, as it is done in recent literature as extension of NMF for multichannel data
(for an introduction see for example [CZPA09])
In case of informed/guided source separation, the ordering or permutation problem can
efficiently be solved by using labeled training data as discussed in chapter 4.3.1.
CHAPTER 3. MATRIX FACTORIZATIO 27
Chapter 3
Literature review on matrix
factorization
In chapter 2.5.1 was shown, that nonnegative matrix factorization can be beneficial for
audio source separation. In recent years, NMF has received a lot of attention, not only
in the audio source separation community. For this reason this chapter is dedicated to
review NMF algorithms, that are of potential interest for the ego-noise reduction.
Matrix factorization (MF) is a technique to represent the matrix X ∈ RF×R by the
multiplication of the two matrices D ∈ RF×K and A ∈ RK×R such that
X = DA (3.1)
As reviewed earlier, the problem of audio signal separation is often performed in the
DFT domain. In this case X represents observations in the time-frequency domain
with the number of frequencies F and the number of frames R. Further on D will
be referred to as dictionary and each of its K columns Dk as codeword. A codeword
hence represents a set of simultaneously excited frequencies, while the entries in Dk
fix the ratio of their amplitudes. Matrix A will be referred to as activation matrix, as
it shows the activity of the codewords over the frames and therefore time.
28
If all involved matrices are nonnegative (X ∈ R+F×R0 , D ∈ R
+F×K0 and A ∈ R
+K×R0 )
the problem is related to as nonnegative matrix factorization (NMF).
Regardless of a constraint on nonnegativity, there exists only a solution X = DA if
rank (X) ≤ K. Due to additive (white) noise generated by the microphones, the rank
of X is F for audio source separation (as usually R > F ). For this reason, the problem
is sometimes also referred to as (nonnegative) matrix approximation. Dictionary and
activation matrix form only an estimate on X:
X = DA (3.2)
In order to determine dictionary and activation function or to evaluate the performance
of some given A andD some cost function or distance measure has to be defined. Some
popular cost functions for matrix factorization are presented in the following.
3.1 Cost functions for matrix factorization
As cost functions, any function can be of interest that measures the similarity between
the estimated samples X and the true samples X. One set of functions for such a
comparison are norms. Function D(
X − X)
is called a norm, if following properties
hold:
• absolute homogeneity: D (aX) = |a|D (X)
• triangle inequality: D (X + Y ) ≤ D (X) +D (Y )
• if D (X) = 0 then X is the zero vector
A more general set of functions are divergences, which will further on be denoted as
D(
X||X)
. Divergences only have to satisfy the following properties
• Nonnegativity: D(
X||X)
≥ 0
• D(
X||X)
= 0 if and only if X = X
3.1. COST FUNCTIONS FOR MATRIX FACTORIZATION 29
For example a divergence does not need to satisfy the triangle inequality and does not
need to be symmetric, such that D(
X||X)
≶ D(
X ||X)
.
In the context of matrix factorization the divergence is usually applied for each element
of the matrices and the total cost is the sum over the divergence of all elements in the
matrix. Further on cost functions and divergences on matrix data will be referred to
with capital letters C and D(
X||X)
and the divergence function on a single element
by the lower-case function d(
Xfr||Xfr
)
. If not stated otherwise, the following relations
hold:
C(
X, X)
= D(
X||X)
=∑
f,r
d(
Xfr||Xfr
)
(3.3)
In appendix D.1 it is pointed out, that using the elementwise or separable divergence
of the observation matrix can be seen as statistical independence assumption between
the observations.
[CZA06] names three classes of generalized divergences that could be of interest for
NMF:
• Csisz’ars divergences
• Bregman divergences
• Amaris alpha divergence
In the following two special cases of the first two divergences and a generalization,
which is named generalized α-β divergence, will be reviewed.
3.1.1 Generalized α divergence as a Csisz’ars divergences
In [CZA06] the focus is on the separableCsisz’ars φ-divergence is
DC
(
X||X)
=∑
r,f
dφ
(
Xr,f , Xr,f
)
(3.4)
30
with dφ(x||y) = xφ(
y
x
)
.
For this work one particular case of the Csisz’ars divergences is of interest, the gen-
eralized α divergence. The derivation is shown in appendix D.2. In this work the
divergence is defined in accordance to [CZPA09] and [CCA11] as
dα(x||y) =
1α(α−1)
(xαy1−α − αx+ (α− 1)y) α ∈ R \ {0, 1}
x ln xy+ y − x α = 1
y ln y
x+ x− y α = 0
(3.5)
3.1.2 The β divergence as a Bregman divergence
Again the separable Bregman divergences is of interest for matrix factorization, such
that
CB
(
X, X)
=∑
i,j
dϕ
(
Xi,j||Xi,j
)
(3.6)
with the Bregman divergence dϕ (x||y) = ϕ (x)−ϕ (y)−∇ϕ (y) (x− y). ϕ has to be a
strictly convex function with a continuous gradient ∇ϕ.
In [DS05] multiplicative update rules for the Bregman divergence are derived. Later
[LLP12] propose a coordinate descent algorithm using Taylor series expansion. At
this point we will only use the Bregman divergence to derive the β-divergence in the
following.
The β-divergence was introduced by [EK01] in their work on robust maximum likeli-
hood estimation. Later [CAZ+06] and [FBD09] used it for matrix factorization.
In this thesis the β-divergence is defined as it is used by [CCA11] where β is shifted
by +1 compared to other authors. The new notation is used for the comparison with
the generalized α-β-divergence in chapter 3.1.3.
In appendix D.3 the β-divergence is derived as a special case of the Bregman divergences
by using ϕ′′β (x) = xβ−1, ϕ′
β (x) =xβ
βand ϕβ (x) =
xβ+1
(β+1)β.
3.1. COST FUNCTIONS FOR MATRIX FACTORIZATION 31
The β-divergence is then given as
dβ (x||y) =
1β(β+1)
(
xβ+1 + βyβ+1 − (β + 1) xyβ)
β ∈ R \ {−1, 0}
x ln xy+ y − x β = 0
xy− ln x
y− 1 β = −1
(3.7)
3.1.3 Generalized α-β divergence
In [CCA11] a new type of generalized divergence is proposed:
dα,β (x||y) =
− 1αβ
(
xαyβ − αα+β
xα+β − β
α+βyα+β
)
for α, β, α+ β 6= 0
1α2
(
xα ln(
xα
yα
)
− xα + yα)
for α 6= 0, β = 0
1α2
(
ln(
xα
yα
)
+(
xα
yα
)−1
− 1
)
for α = −β 6= 0
1β2
(
yβ ln(
yβ
xβ
)
− yβ + xβ)
for α = 0, β 6= 0
12(ln (x)− ln (y))2 for α, β = 0
(3.8)
As shown in table 3.1, the generalized α-β divergence includes the previously defined α
divergence and β divergence as special cases. Before investigating some special cases,
first some of the generalized α-β divergence’s properties are worth noting:
• Duality : Dα,β
(
X||X)
= Dβ,α
(
X||X)
• Inversion: D−α,−β
(
X||X)
= Dα,β
(
X .[−1]||X .[−1])
• Scaling : Dωα,ωβ
(
X||X)
= Dα,β
(
X .[ω]||X .[ω])
• Zoom: As described in the following:
For α 6= 0 and with β = β
α, it holds that
Dα,β=αβ
(
X||X)
=∑
fr
−1
α2β
(
XαfrX
αβfr −
1
1 + βX
α(1+β)fr −
β
1 + βXα(1+β)
)
=1
α2Dβ
(
X .[α]||X .[α])
m= Dβ
(
X .[α]||X .[α])
(3.9)
So the α-β-divergence can be interpreted as a β-divergence with parameter β
αwhere
an α-zoom is applied on its argument.
32
This α-zoom property can hence also be applied to all the special cases of the β-
divergence. A possible use of this interpretation is to define the α parameter to make
the observations for the Itakura-Saito divergence more Gaussian (see chapter 2.5.1).
For the sake of completeness, a similar equivalence can be determined for the α-
divergence, if α+ β 6= 0.
Dα,β
(
X||X)
m= Dα
(
X .[α+β]||X .[α+β])
with α =α
α + β(3.10)
An overview of the special cases of the generalized α-β divergence is listed in table 3.1.
Table 3.1: List of divergences that are included in the generalized α-β divergence
α β model
α-divergence α 1− α
β -divergence 1 β
Squared euclidean distance 1 1 Gaussian
Kullback-Leibler divergence 1 0 Poisson
Itakura-Saito divergence 1 -1 Gaussian or Gamma multiplicative noise
Hellinger distance 0.5 0.5
Log-Euclidean distance 0 0
In the following we will explicitly view some divergences that are of particular impor-
tance for NMF, as they can be interpreted as ML estimate of some model. A review
on the models can be found in appendix D.1.
Squared Euclidean distance
For α = 1, β = 1 equation 3.8 results in the squared Euclidean distance or least squares
which can be described by the Frobenious norm of the estimation error:
Dα=1,β=1
(
X||X)
= CLS
(
X, X)
= ||X − X||2F =∑
i,j
(
Xi,j − Xi,j
)2
(3.11)
3.1. COST FUNCTIONS FOR MATRIX FACTORIZATION 33
Furthermore one can define a generalized squared Euclidean distance with α-zoom as
Dα,β=α
(
X||X)
=∑
fr
−1
α2
(
XαfrX
αfr −
1
2X2α
fr −1
2X2α
)
m= ||X .[α] − X .[α]||2F = CLS
(
X .[α], X .[α])
(3.12)
Kullback-Leibler divergence
For α = 1, β = 0:
Dα=1,β=0
(
X||X)
= CKL
(
X, X)
=∑
i,j
Xi,j lnXi,j
Xi,j
−Xi,j + Xi,j (3.13)
For∑
i,j Xi,j =∑
i,j Xi,j = 1 the cost functions is the Kullback-Leibler divergence or
relative entropy. Again a generalized Kullback-Leibler divergence with α-zoom results
by
Dα,β=0
(
X||X)
m= CKL
(
X .[α], X .[α])
(3.14)
Itakura-Saito divergence
Another cost function, introduced in [FBD09], is the Itakura-Saito divergence. It is
the limit case for α = 1, β = −1:
Dα=1,β=−1
(
X||X)
= CIS
(
X, X)
=∑
i,j
Xi,j
Xi,j
− lnXi,j
Xi,j
− 1 (3.15)
One interesting property of the Itakura-Saito divergence is its scale-invariance:
dIS (γx|γy) = dIS (x|y) (3.16)
For α = −β a generalized Itakura-Saito divergence with α-zoom can be derived as
Dα,β=−α
(
X||X)
m= CIS
(
X .[α], X .[α])
(3.17)
As an example for divergences that cannot be represented by the generalized α-β
34
divergence, a class of generalized γ-divergence can be derived by applying a nonlinear
transformation [CCA11]. In this report though, the focus regarding divergence is on
the versatile generalized α-β divergence.
3.2 Properties of (nonnegative) matrix factoriza-
tion
In this section some important properties of matrix factorization are described.
3.2.1 Indeterminacies
The factorization of matrix Y into two matrices D and A introduces new indetermi-
nacies. It is important to discriminate the MF indeterminacies to the indeterminacies
introduced with a mixing model (see chapter 2.1.2 and 2.6).
Scaling
In the mixing model there is the amplitude and phase ambiguity between mixing ma-
trix and source signal. In addition to that, a scaling ambiguity arises for the NMF
parameters. It can best be explained by formally introducing a K×K diagonal matrix
Λ into the estimate as Y = DΛΛAΛ, where DΛ and AΛ are scaled versions of D,
A such that ||DΛ,k||p = ||AΛ,k||p = 1∀k. p can be chosen such that the entries on Λ
correspond to the energy of the components (e.g. p = 1 if Y represents the magnitude
squared or p = 2 for the magnitude of the actual observed microphone signal). By
shifting the energy from Λ arbitrary onto D and A for each component k, one can
create an infinite number of sets of activation matrix and dictionary, which result all
in the same cost function.
In [OF10] this scaling ambiguity is for example removed by rescaling activation matrix
and dictionary, such that all energy is transferred to the activation coefficients.
3.3. SOLVING THE MATRIX FACTORIZATION PROBLEM 35
Permutation
Besides the scaling ambiguity NMF also has a permutation ambiguity. This spectral
permutation was already discussed in chapter 2.6, after NMF was introduced as ML
estimator for variance parameters.
3.2.2 Local optimality
In chapter 3.3 algorithms are introduced to optimize the MF cost function. At this
point only the optimality of these algorithms is discussed.
It is well known, that NMF cost functions are convex in A and also convex in D, yet
the way NMF was treated in most literature, it is non-convex in both parameters (e.g.
in [CCA11]). These algorithms therefore perform local optimization and thus require
a suitable initialization. On the positive side, when considering NMF as ML param-
eter estimator in source separation, we are not necessarily interested in the optimum
solution, but only a set of parameters that provides sufficiently good separation results.
Nevertheless some research in the area of convex NMF algorithms has also been per-
formed (see e.g. [VGA09] or convex-hull NMF in [TKWB11]).
3.3 Solving the matrix factorization problem
This chapter gives examples for the two most common approaches to solve the MF
problem, expectation maximization (EM) and multiplicative update (MU) rules.
3.3.1 Nonnegative matrix factorization via multiplicative up-
date rules
The multiplicative update rule was used in its very beginning by [LS00] for the squared
Euclidean distance and the Kullback-Leibler divergence.
Multiplicative updates ensure nonnegativity by an elementwise multiplication with a
nonnegative matrix in each iteration
36
As the inverses of nonnegative matrices are only nonnegative if the matrix is ... (citation
needed), MU rules usually do not require the calculation of inverses.
Another interesting property is, that parameters that are initialized with zero, stay
zero.
Multiplicative Update (MU) algorithm for the generalized α-β-divergence
In [CCA11] multiplicative update rules for their generalized α-β-divergence are pre-
sented. They can thus be used for all the divergences listed in table 3.1. In appendix
D.4 the derivation is reviewed to highlight its similarity to the multichannel MU rules
derived in chapter 4.4.
For α 6= 0 the update rules can be formulated for the matrices as
A← A.
(
DT (X .[β−1].X .[α])
DT X .[α+β−1]
).[ 1α]
(3.18)
D ←D.
(
(X .[β−1].X .[α]AT
X .[α+β−1]AT
).[ 1α]
(3.19)
For completeness the update rule for the β-divergence is derived by setting α = 1. In
matrix notation this writes as
A← A.
(
DT (X .[β−1].X)
DT (X).[β]
)
(3.20)
D ←D.
(
(X .[β−1].XAT
X .[β]AT
)
(3.21)
and is equivalent to the updates in [FBD09] when considering that their β-divergence’s
β is shifted by +1
Also the α-divergence update rules can be derived by inserting β = 1− α:
A← A.
(
DT (X .[α].X .[−α])
DT (X).[0]
).[ 1α]
(3.22)
D ← D.
(
(X .[−α].X .[α]AT
X .[0]AT
).[ 1α]
(3.23)
3.3. SOLVING THE MATRIX FACTORIZATION PROBLEM 37
Multiplicative Update (MU) algorithm for multichannel data
[OF10] presented following MU update rules for the optimization problem derived in
chapter 2.5.2 by discarding the mutual information between the channels. Recall the
cost function of equation 2.32
CML((G,D,A) =∑
uf,r
dIS(|Yufr|2||VYufr)
The corresponding multichannel MU rules are given by
A← A.
(
∑
u(diag(Qu,jk)D)T (X−2u .Xu)
∑
u(diag(Qu,jk)D)T X−1u
)
(3.24)
D ←D.
(
∑
u diag(Qu,jk)(X−2u .Xu)A
T
∑
u diag(Qu,jk)X−1u AT
)
(3.25)
Qu,j ← Qu,j.
(
((DjAj).X−2u .Xu)1R×1
((DjAj).X−1u )1R×1
)
(3.26)
In chapter 4.4 it is shown, that equation 3.24 is a special case of the multichannel MU
rules for the α-β-divergence by considering that the Itakura-Saito divergence is given
for α = 1 and β = −1.
Furthermore the authors use the following normalization for each iteration to remove
the scaling indeterminacy:
• Scaling of Dj and Gj , such that∑
u |GS,ujf |2 = 1 and GS,1jf ∈ R+
• Scaling of Dj and Aj, such that∑
f Dfk = 1
38
3.3.2 Multichannel EM algorithm for the sum of Gaussian
components
The second cost functions derived in chapter 2.5 was
CML(G,D,A) =
fr∑
fr
log(det(ΣB,f )) + (Yfr −GC,fCfr)HΣ−1
B,f(Yfr −GC,fCfr)
+∑
k
log(Df,kAk,r) +|Ck,fr|2
Df,kAk,r
In appendix D.5 it is shown, that the parameters can be estimated by the following
equations
Dfk =1
R
∑
r
|Ckfr|2
Akr
∆=
1
R
∑
r
Ukfr
Akr
(3.27)
Akr =1
F
∑
f
Ukfr
Dfk
(3.28)
GS,f = RY S,fR−1SS,f (3.29)
ΣB,f = diag(RY Y ,f +GS,fRHY S,f +RY S,fG
HS,f +GS,fRSS,fG
HS,f) (3.30)
It can be seen in equation 3.27 to 3.30, that the set {Ukfr,RY Y ,f ,RY S,f ,RSS,f ∀f, r, k}
is sufficient to calculate all parameters. In accordance with the authors, they will be
named natural (sufficient) statistics. However, as the source and components can not
be observed, they have to be estimated before the parameters can be estimated.
Rewriting the estimate of the components by equation 2.30 in a vector for all compo-
nents with ΣC,fn = diag([VC,kfn]K) and VC,kfn = DfkAfr
Cfr = [Ck,fr]K = ΣC,fnGHjk,f
Σ−1Y,frYfr (3.31)
3.4. CONSTRAINTS AND VARIANTS OF MATRIX FACTORIZATION 39
and for the sources using ΣS,fn = diag([VS,jfn]J) and VS,jfn =∑
k ∈ KjDfkAfr
Sfr = [Sj,fr]J = ΣS,fnGHj,fΣ
−1Y,frYfr (3.32)
[OF10] proposes the following estimates for the natural statistics in their EM algorithm:
RY Y ,f = RY Y ,f =1
R
∑
r
YfrYHfr (3.33)
RY S,f =1
R
∑
r
YfrSHfr (3.34)
RSS,f =1
R
∑
r
SfrSHfr +ΣS,fr −ΣS,frG
Hj,f(ΣY,fr)
−1Gj,fΣS,fr (3.35)
ukfr = (CfrCHfr +ΣC,fr −ΣC,frG
Hj,f(ΣY,fr)
−1Gj,fΣC,fr)k,k (3.36)
The EM algorithm executes then the following steps for several iterations:
• : E-step: Updating the natural statistics RY S,f , RSS,f , ukfr
• M-step: Updating the parameters Dfk, Akr, GS,f
• Normalization to remove scale and phase ambiguities
– Rescaling Dj and Gj , such that∑
u |GS,ujf |2 = 1 and GS,1jf ∈ R+
– Rescaling Dj and Aj, such that∑
f Dfk = 1
3.4 Constraints and variants of matrix factorization
Very often matrix factorization is not used in its plain form, as defined at the beginning
of chapter 3. In order to improve the estimation results, to incorporate labeled data
or to take priors into account, constraints are added to the MF cost function in many
publications.
The constraints are usually added to the NMF optimization problem as a penalty term
with the Lagrangian multiplier λ as
argminθ
CNMF (θ) + λf(θ) (3.37)
40
The NMF optimization can then either be seen as an equality constraint f(θ) = c or
as an additional minimization tasks min f(θ), where λ controls the trade-off between
the cost functions.
Due to the additive implementation, the constraints can usually be rather straightfor-
wardly integrated into gradient based algorithms of the original NMF problem, such
as the MU rules.
The amount of proposed constraints grew so much in recent years that it is impossible
to review all of them. There exist for example adaptations of NMF to the source-filter
model of speech production [OVB12], Bayesian NMF approaches [D+09] [VG+08] or
NMF models incorporating priors [OVB12][VJA+10]. At this point the focus is mainly
on two types of constraints. Constraints that resolve ambiguities in the most common
NMF training scenario (see chapter 4.3.4) and constraints that directly incorporate
labeled training data. In the following the most promising algorithms and ideas for the
investigated scenario of ego-noise reduction are reviewed.
3.4.1 Sparsity
Sparsity of activation coefficients of dictionary can be used in audio data to enforce a
harmonic structure into the dictionary or account that components may be active only
in a few time bins (for example a drum in music separation).
A straightforward way to force zero elements into the parameters is to penalize the
number of nonzero elements. The number of nonzero elements is equivalent with the
L0-norm and unfortunately not well suited for optimization. It is hence frequently
approximated with the L1 norm. A sparser dictionary can for example be achieved by
adding the constraint
min ||Dk||1 = min∑
f
|Dfk|∀k (3.38)
Note that due to the nonnegativity constraint, |Dfk| = Dfk.
3.4. CONSTRAINTS AND VARIANTS OF MATRIX FACTORIZATION 41
[LN13] derives MU rules for the Euclidean NMF including L1 norm sparsity constraints.
More information on sparsity in NMF literature can be found at [Vir07]. The authors
also report that the sparsity constraint did not lead to significant improvement in their
single channel audio source separation experiments.
3.4.2 Smoothness and temporal continuity
In [LN13] the authors enforce smoothness via an L2 norm for the dictionary and acti-
vation function
min ||Dk||22∀k (3.39)
and/or min ||Ak||22∀k (3.40)
and derive MU rules for the Euclidean NMF.
In [Vir07], the following constraint is used to enforce continuity of the activation vectors
min∑
k
∑R
r=2(Akr − Ak(r−1))2
1R
∑
r(Akr)2(3.41)
and it is solved via MU rules. A similar temporal continuity is used in [EF13].
3.4.3 Local NMF
A lot of NMF constraints are developed for image procession task, such as face recog-
nition. One of these is the local nonnegative matrix factorization in [LHZC01].
They add to the Kullback-Leibler divergence the following four constraints:
• L1 sparsity constraint on dictionary : subject to:∑
f Dfk = 1∀k
• L2 smoothness constraint on the dictionary : min∑
f D2fk∀k
• penalizing non-orthogonal codewords : min∑
f1 6=f2Df1kDf2k∀k, f2
• maximizing the activity of components : min−∑
r A2kr∀k
42
3.4.4 Semi-orthogonal NMF
In [CZPA09] the authors propose to apply a constraint on dictionary or activation
matrices by applying one of the following updates at each iteration
D = D(DTD)−1
2 (3.42)
A = (AAT )−1
2A (3.43)
3.4.5 Fischer NMF
In [WJ04] the Fischer constraint from the Fischer linear discriminant analysis is added
to the NMF formulation.
minSW − SB (3.44)
with the within-class scatter SW = 1J
∑
j1Kj
∑
k∈Kj||Ak − Aj||22 and the between class
scatter SB = 1J(J−1)
∑
j1
∑
j2||Aj1 − Aj2||
22. J denotes the number of classes or in our
case sources, Kj the codebook size of the j-th source and the mean over a class is given
by Aj =1Kj
∑
k∈Kj|Ak
However there is a severe difference between face recognition and source separation.
For face recognition only one face is used as input, while at source separation any
number of sources can be active at any time. In other words, Fischer NMF tries to find
a codebook where the probability of an observed activation belonging to a certain class
is maximum. In source separation the goal is to find codewords, where the activation of
one component is maximum and the activation of all other components is minimum. A
suitable measure could be for example to find the dictionary, where the corresponding
activation matrices have a maximum ratio of desired to undesired representation.
Nevertheless, Fischer NMF is worth noting as it was one of the first NMF algorithms
that explicitly used the class information in the training phase. In appendix C.0.7, a
generalization of the Fischer NMF, the nonnegative graph embedding is reviewed.
3.4. CONSTRAINTS AND VARIANTS OF MATRIX FACTORIZATION 43
3.4.6 Scaling problem with a single constraint
It is worth noting that if the scaling ambiguity of NMF is not considered in the design
(e.g. by a suitable extra constraint), the scaling property allows to shift the energy
to the unconstrained parameter(s). In the multichannel case, the scaling ambiguity
between source and mixing coefficients has to be taken into account as well.
For example, if the dictionary is penalized by an L1 or L2 norm, the algorithm can just
scale the energy into the activation function. This may also be the reason why an L1
and L2 norm constraint is applied in the local NMF algorithm (see chapter 3.4.3).
There are also constraints, such as the temporal continuity constraint in chapter 3.4.2,
that are invariant to scaling due to its normalization.
44 CHAPTER 4. EGO-NOISE SUPPRESSION FOR THE ROBOT NAO
Chapter 4
Ego-noise suppression for the robot
NAO
In the two previous chapters, algorithms and ideas from audio source separation and
NMF were reviewed. This chapter intents to connect these information with the sce-
nario of robot ego-noise reduction. It is separated into two parts. First the noises
the robot has to suppress are identified and analyzed. Second a general structure and
suitable algorithms are selected and adopted to tackle the ego-noise reduction.
4.1 Ego-noise analysis of NAO
Before suitable suppression techniques can be designed, the different ego-noise types
of NAO have to be identified and analyzed. In the introduction it was already pointed
out that a robot typically suffers from non-stationary and stationary noises. A picture
of NAO is shown in figure 4.1.
The recordings used in this thesis were made with 4 microphones, which are all located
in the head of the robot. As it can be seen in figure 4.2, two are located at the
front (positions A and B) and two in the back of the head (positions C and D).The
microphones create their own noise with their sensors and amplifiers. This microphone
1 https://jv.wikipedia.org/wiki/Nao_(robot)
4.1. EGO-NOISE ANALYSIS OF NAO 45
Figure 4.1: Aldebarans humanoid robot NAO 1
self-noise is further on related to as static noise.
Figure 4.2: Four microphones are located at NAOs head 2
The robots processor is located in its head, too. It is cooled by a fan located between
microphones C and D marked by position E. The fan creates audio noise as well, further
on referred to as fan noise.
Besides the static and fan noise, the robot creates noise through its movement. For the
movement the robot has a lot of degrees of freedom, as it can be seen in figure 4.3. For
this thesis two movements are selected as examples, the noise created by NAO waving
his right arm and the noise of NAO walking on the spot. Furthermore, as it proves to
be a challenging enough task, only one movement noise is tackled at a time. This leads
2 http://doc.aldebaran.com/2-1/family/robots/microphone_robot.html
46
to the two scenarios, where the robot either creates walking noise or waving noise.
In the next sections the four noises described above are analyzed. The main focus is
hereby whether the noises fulfill criteria relevant for the representation by NMF. Other
assumptions such as Gaussianity or statistical independence, which were made for the
derivation of the estimation by the wiener filter are not investigated, because they are
not essential for the NMF model.
Figure 4.3: The degrees of freedom for NAO’s movement3
4.1.1 Fan and static noise
As the static noise is always present, fan and static noise are analyzed jointly in this
chapter. A short time power spectrogram with a frame size of 1024 samples is shown in
figure 4.4. As it can be seen, the spectrum changes very little over time. The power in
the STFT domain can thus assumed to be stationary. One result of this assumption is,
that an average over all frames is a suitable representation for these noises. Figure 4.5
shows the average power of the DFT frequency bins calculated for each microphone.
As it can be seen in the upper plot, most of the noise is at frequencies less than three
kHz. In the lower plot it is shown that the spikes created by the fan noise are roughly
at the same position for all microphones, their intensity though varies between the
3 https://jv.wikipedia.org/wiki/Nao_(robot)
4.1. EGO-NOISE ANALYSIS OF NAO 47
time [s]0 2 4 6 8 10
freq
uenc
y [k
Hz]
0
2
4
6
8
time [s]0 2 4 6 8 10
freq
uenc
y [k
Hz]
0
2
4
6
8
time [s]0 2 4 6 8 10
freq
uenc
y [k
Hz]
0
2
4
6
8
time [s]0 2 4 6 8 10
freq
uenc
y [k
Hz]
0
2
4
6
8
Figure 4.4: Short time power spectrogram of fan and static noise for all microphones
microphones up to several 10 dB.
Furthermore, in order to have an idea of what part of the noise in figure 4.5 is created
frequency [kHz]0 1 2 3 4 5 6 7 8
aver
age
pow
er [d
B]
-30
-20
-10
0
10
20
frequency [kHz]0.5 1 1.5 2 2.5
aver
age
pow
er [d
B]
-15
-10
-5
0
5
10
15
20
Figure 4.5: Average power over frequencies of fan and static noise for all microphones
by the static microphone noise, their average power is compared in figure 4.6. As it
can be seen, the static noise is clearly beneath the fan noise for all frequencies. Even
if one would remove the fan spikes, the remaining broadband fan noise is significantly
higher than the static noise. Nevertheless, if the fan is turned off, the static noise can
be heard in speech pauses. Static and fan noise of the NAO have already been analyzed
in the frequency domain with similar frame length in [LBD+14].
48
frequency [kHz]0 1 2 3 4 5 6 7 8
aver
age
pow
er [d
B]
-30
-20
-10
0
10
20
fan and static noisestatic noise
Figure 4.6: Energy in fan and static noise that is actually created by static microphonenoise
We propose a new approach for the analysis of the fan noise, by exploiting its ap-
proximate stationarity. By significantly increasing the DFT length (in this example
to one second) the average power of the fan in the frequency domain gives some quite
surprising results, that were not observable before. The frequency range from 0 to 3
kHz, as shown in the upper plot in figure 4.7, shows a lot more spikes compared to the
result from a 1024 point DFT in figure 4.5. By zooming into a certain frequency band
(lower plot of figure 4.7) it can be seen, that an individual spike of the fan noise is very
narrowband, in most cases only a few Hertz. Obviously this frequency sparsity cannot
be exploited by wiener filtering in the STFT domain with frequency resolution smaller
than the bandwidth of the fan spikes.
Furthermore, we propose a new model for the fan noise by transforming a single spike
of a microphone back into the time domain. If this is for example performed for the
frequency band from 480 Hz to 500 Hz, the time domain signal is given by figure 4.8.
In the upper plot, it is obvious, that the fan spike is slightly amplitude modulated. By
adding a sinus as reference, which matches the spike signal at the beginning and the
end of the frame (e.g. by counting zero crossings) it can be shown that the spike is
also slightly frequency modulated (lower middle plot).
A suitable model for the fan noise could therefore be
Xfan,t =∑
f∈Ffan
aft sin(2π(f + bft)t) (4.1)
4.1. EGO-NOISE ANALYSIS OF NAO 49
frequency [kHz]0 0.5 1 1.5 2 2.5 3
aver
age
pow
er [d
B]
0
10
20
30
40
frequency [Hz]1250 1260 1270 1280 1290 1300 1310 1320 1330 1340 1350
aver
age
pow
er [d
B]
0
10
20
30
40
Figure 4.7: average power of fan and static noise for all microphones by a 1 secondDFT
where aft and bft are modeled as random processes with statistical dependencies over
time and possibly also over frequencies. Even a statistical dependency between ampli-
tude and frequency is possible.
time [ms]0 100 200 300 400 500 600 700 800 900 1000
-0.01
-0.005
0
0.005
0.01
time [ms]2 4 6 8 10 12
-0.01
0
0.01
time [ms]190 195 200
-0.01
0
0.01
time [ms]990 995 1000
-0.01
0
0.01
spike in time domainsinusoid reference
Figure 4.8: single fan spike in the time domain with reference sinusoids (bottom plots)
50
4.1.2 Movement noise
As mentioned earlier, the movement noise can only be observed with additive fan and
static noise. For this reason, this analysis and later the algorithms are not performed on
the direct recordings, but on a prefiltered versions. The applied prefiltering techniques
are described in chapter 4.2.
Walking
The walking noise, which NAO creates by walking on the spot, is discussed as first
movement noise. Figure 4.9 displays the short time power, using a framelength of 512
samples. The walking noise clearly shows a structure in time and frequency domain.
In the time domain the excitation has a resolution less than a second. The excitation
also shows a similar frequency characteristic. The microphones in the top row look
quite similar, whereas the microphones in the bottom row are clearly different.
time [s]0 5 10 15 20
freq
uenc
y [k
Hz]
0
2
4
6
8
time [s]0 5 10 15 20
freq
uenc
y [k
Hz]
0
2
4
6
8
time [s]0 5 10 15 20
freq
uenc
y [k
Hz]
0
2
4
6
8
time [s]0 5 10 15 20
freq
uenc
y [k
Hz]
0
2
4
6
8
Figure 4.9: Short time power spectrogram of walking noise for all microphones
Furthermore, it is of interest in how good the power in the time-frequency domain
can be represented by a low rank matrix. Or in other words, if the NMF model is
reasonable. The NMF residual error using the Euclidean cost function is displayed
in figure 4.10 for the four microphones. The error is normalized by the total power
of the matrix to be factorized, such that the normalized residual error is 0 dB for a
codebooksize of zero.
4.1. EGO-NOISE ANALYSIS OF NAO 51
As it can be seen, the residual can be decreased for three of the microphones by more
than 10 dB with less than 10 codewords. For 15 dB reduction, very large codebook
sizes of up to 60 are required.
size of codebook0 20 40 60 80 100 120
resi
dual
to to
tal p
ower
rat
io
-25
-20
-15
-10
-5
0
Figure 4.10: residual error for NMF approximation of waving power in time-frequencydomain
Waving
The waving noise shows as well a structure in the time and the frequency domain as
shown in figure 4.11. Compared to the walking noise, the excitation over time is with
several seconds rather long. Also the frequencies above 3 kHz are more distorted than
for the walking noise. The individual peaks show some harmonic structure, which is
not constant but with the highest frequencies in the middle of the excitation.
The representation of the short time power by a low rank matrix is very similar to the
walking noise (see figure 4.12). With the first 10 codewords a residual error of over
-10 dB can be achieved for three out of the 4 microphones. Compared to the walking
noise, the residual error is roughly 5 dB lower for large codebooks.
4.1.3 Speech
Besides the movement noise it is also important, whether the short time power of the
speech can be represented by a low rank matrix. As shown in figure 4.13, the speech has
52
time [s]0 5 10 15 20
freq
uenc
y [k
Hz]
0
2
4
6
8
time [s]0 5 10 15 20
freq
uenc
y [k
Hz]
0
2
4
6
8
time [s]0 5 10 15 20
freq
uenc
y [k
Hz]
0
2
4
6
8
time [s]0 5 10 15 20
freq
uenc
y [k
Hz]
0
2
4
6
8
Figure 4.11: Short time power spectrogram of waving noise for all microphones
size of codebook0 20 40 60 80 100 120
resi
dual
to to
tal p
ower
rat
io
-30
-25
-20
-15
-10
-5
0
Figure 4.12: Residual error for NMF approximation of waving power in time-frequencydomain
similar short excitation as the walking noise and shows quite some harmonic structure.
time [s]0 5 10 15 20fr
eque
ncy
[kH
z]
0
2
4
6
8
time [s]0 5 10 15 20fr
eque
ncy
[kH
z]
0
2
4
6
8
time [s]0 5 10 15 20fr
eque
ncy
[kH
z]
0
2
4
6
8
time [s]0 5 10 15 20fr
eque
ncy
[kH
z]
0
2
4
6
8
Figure 4.13: Short time power spectrogram of waving noise for all microphones
4.2. REDUCTION OF STATIONARY COMPONENTS AS PREPROCESSING 53
Figure 4.14 shows that in order to represent the speech power with a residual of -10
dB more than 20 codewords are required. A reduction by 15 dB cannot be achieved
for all microphones regardless of the codebook size. It has however to be noted that
for large codebooks, the local optimality of the NMF algorithm is decisive. Obviously
it is possible to represent 129 frequencies by 129 codewords without error, for example
by only picking codewords with one frequency each.
size of codebook0 20 40 60 80 100 120
resi
dual
to to
tal p
ower
rat
io
-20
-15
-10
-5
0
Figure 4.14: Residual error for NMF approximation of speech power in time-frequencydomain
The average power between the two movement noises and the speech can be compared,
too (see figure 4.15). In the band from 200 Hz to 1 kHz, where the speech has most of
its power the speech to interference ratio (SIR) lies between -3 and +13 dB. For lower
frequencies the walking SIR becomes really bad, as does the waving SIR from 1 to 2
kHz.
4.2 Reduction of stationary components as prepro-
cessing
In chapter 4.1, two stationary noise sources were identified, the fan noise and the static
noise. Due to the stationarity, there exist efficient algorithms to reduce these noises
in a pre- or postprocessing step. At this point however the question arises, whether
54
frequency [kHz]0 1 2 3 4 5 6 7 8
aver
age
pow
er [d
B]
-30
-20
-10
0
10
20
wavingwalkingspeech
frequency [kHz]0 1 2 3 4 5 6 7 8
aver
age
pow
er [d
B]
-30
-20
-10
0
10
20
SIR wavingSIR walking
Figure 4.15: Average power (over time and microphones) of movement noise and speechover frequency
a separate treatment of the stationary noise types is necessary. In chapter 2.5.2 a
model was defined that can cope with spatially white noise, such as the static noise.
It was also pointed out at the end of chapter 2.5.1, that a stationary Gaussian process
can be modeled in the STFT domain by a source with a single component and unity
activation.
Nevertheless, both noise types are reduced in a preprocessing step in this work. The
main reason for that is, that it is not possible to get reference signals for training
without fan and static noise (with exception of a speech training signal where the
fan was turned off). One could still get a reference with static and fan noise and train
those static noise parameters beforehand. Then one could train the other sources while
incorporating the additional static noise on the reference files with the pretrained pa-
rameters. It is however questionable, whether this additional effort results in sufficient
benefit compared to a reduction by preprocessing.
Probably the most common algorithm for such a problem is a multichannel Wiener fil-
ter. In [LBD+14] a speech distortion weighted multichannel Wiener filter (see [SMW04]),
for the NAO robot is proposed both for the static and the fan noise. But, in this thesis
4.2. REDUCTION OF STATIONARY COMPONENTS AS PREPROCESSING 55
the static noise reduction is used in a preprocessing step. In order to avoid loosing
spatial information for future processing, it is essential to apply the exact same filter
on all channel.
4.2.1 Exploiting the high sparsity in the frequency domain to
reduce fan spike
Instead of applying a single channel Wiener filter for static and fan noise, a different
approach is proposed in this thesis. Recall the properties of the fan noise described in
chapter 4.1.1. It is composed of 20-50 frequencies with an average bandwidth of 5-10
Hz. The exact numbers strongly depend on the definition of what is counted as a spike
of the fan noise and what is not.
In general a Wiener filter is very well suited for sparse signals in the frequency domain.
However to fully exploit the fact that the bandwidth of the active components is few
Hertz, we aim at a frequency spacing of at least 1 Hz (and will assume it is 1 Hz further-
on). Hence a DFT approximately of length NDFT = Fs
1[Hz]has to be used, where Fs
denotes the sample rate. The idea is now to determine and attenuate all frequencies,
where the fan noise can be assumed to be a lot stronger than the speech. To do so, we
propose the following approach:
• First the mean energy of the fan noise is calculated over all channel and frames
(blue line in figure 4.16).
• Second all frequency bins are extracted, which are 20 dB above the local average.
We calculate the local average by convolution with a (rectangular) window of size
500 and proper scaling.
• Third, the determined frequencies are reduced to actual spikes: We define all
frequencies as spikes, which are a local maximum in the bandwidth corresponding
to the 20 Hz to both sides. In order to find the true (center) of the peak positions,
the search can be applied on a smoothed version of the power spectrum (e.g. by
56
convolution with a 3 Hz rectangular window as depicted by the cyan line in figure
4.16).
• Last, the bandwidth of the fan noise has to be determined for each spike. As
bandwidth, for example, the minimum distance to the center frequency can be
used, at which the energy (we again use the curve smoothed by a 3Hz rectangular
window) drops below a certain threshold. For this report the threshold (green
curve in figure 4.16) was set to 7 times the energy of the background (static)
noise in this frequency band. We approximate the background noise power by a
median of the energy in the 40 Hz band. Due to the median operator, the static
noise level estimate is less affected by the fan noise peak.
frequency [kHz]0.795 0.8 0.805 0.81 0.815 0.82 0.825 0.83 0.835
[dB
]
0
10
20
30
40average energy over frames and microphonessmoothed version of average energyestimated static noisedetermined peak and bandwidth positions
Figure 4.16: Fan frequency detection algorithm
Using the above specified parameters the algorithm detects 21 frequencies for the fan
noise with an average bandwidth of 6 Hz. Compared to the total number of Fs
2= 8000
Hz, 1.6% of the frequencies are corrupted by the fan noise spikes.
With this knowledge there exist several possibilities to get rid of the fan noise. The
power of the determined frequency bins can be
• set to zero.
• set to a fixed low value (e.g. the threshold described earlier).
• set to an estimated value interpolated from the surrounding frequencies that are
not distorted by the fan.
4.2. REDUCTION OF STATIONARY COMPONENTS AS PREPROCESSING 57
• reduced by the power the frequency bin had in the training phase (spectral sub-
traction)
In our experiments the results were rather similar for all methods described above, so
further on the simplest version, where all fan frequency bins are set to zero, is applied.
The approach via the long DFT is not very suitable for real time implementation, as
either a delay up to one second is introduced or many overlapping frames have to be
used such that the computational complexity increases rapidly.
The computational complexity can be reduced by taking into account that the fan
noise only significantly contributes to approximately 21 · 6 out of the NDFT = 16000
frequency bins. Thus only these frequencies have to be calculated. Then, a suitable
manipulation can be performed (or nothing is done if the frequency bins shall be set to
zero). Afterwards the fan frequency bins are transformed back into the time domain
and subtracted from the original signal.
The computational complexity or delay could, for example, further be decreased by
developing a scheme for recursive calculation of the desired DFT bins. Such an algo-
rithms would then be closely related to frequency tracking algorithms (see for example
[VZ12]).
If the frequency bins are set to zero, this approach can equivalently be seen as applying a
filter with 21 stopbands of varying bandwidth. An alternative approach could therefore
also be to design suitable (IIR) filters for the given scenario.
As the focus of this work is on evaluating if NMF is well suited for (nonstationary)
ego-noise reduction, the straightforward implementation via a long DFT was used in
this thesis.
58
4.2.2 Static noise reduction by single channel wiener filter
The static noise and the residual of the fan noise is afterwards suppressed by a Wiener
filter with pretrained noise power VNstatic,fr.
HWF,fr = max
[
0,
(
1−VNstatic,fr
1U
∑
u |Yufr|2
)]
(4.2)
The estimate of the signal without static noise is then given by Yufr = HWF,frYufr
4.3 Reduction of non-stationary ego-noise
After removing the stationary parts, the non-stationary suppression technique can be
trained with less interference from the stationary components.
In the literature review, two multichannel NMF algorithms had been highlighted:
In chapter 3.3.1 MU rules for a multiple mixtures with the Itakura-Saito divergence
were presented. We will generalize this algorithm in chapter 4.4. The second multi-
channel algorithm reviewed was EM based. As discussed in chapter 4.5 computational
complexity reduces its usefulness for the given problem. In the same chapter we will
thus propose a computationally inexpensive approach with a mutlichannel prefilter.
Yet one important part of any NMF method is still not discussed, the pretraining of
NMF methods. Especially in the first years of NMF, this topic has received little at-
tention. In the following chapter the most common NMF training approach is reviewed
and an interpretation is given, why it is not well suited for less sparse signals.
4.3.1 Informed source separation by pretraining of the dictio-
nary
In chapter 2.6 it was shown that blindness leads to a permutation problem and also
model based source separation approaches can suffer from permutation. As presented in
appendix C.1, there is no dominant term for source separation using reference signals
4.3. REDUCTION OF NON-STATIONARY EGO-NOISE 59
in the literature. In this work reference based source separation algorithms will be
termed informed or guided.
The first approach to incorporate training data in the NMF algorithms was to initialize
(and fix) the dictionary of each source by a clean recording similar to that source (see
e.g. [Sma07],[OF10],[SFM+14]). Training observations are further on indicated by an
additional index TR, for example the j-th source of the u-th mixture as YTR,ujfr in the
time frequency domain.
Training can then be performed by the same algorithm as the source separation later.
Solving the NMF problem with observations Y1 is written as argminCNMF (Y1). Such
that the training is represented as
{GTR,j ,DTR,j ,ATR,j} = argminG,D,A
CNMF (YTR,j) ∀j (4.3)
Afterwards a unified dictionary is created, by concatenating the dictionaries of all
sources for the test scenario:
DTE = [DTR,1, ...,DTR,j , ...,DTR,J ] (4.4)
Usually the dictionary is then fixed for the testing phase such that the testing opti-
mization simplifies to
{GTE,ATE} = argminG,A
CNMF (YTE) (4.5)
If the NMF algorithm is for example applied to the time-frequency domain variance.
The variance of the j-th source is estimated by extracting the activation vectors cor-
responding to the sources components out of the complete activation matrix. The
estimate is then calculated as XTE,j = DTR,jATE,j.
However, other combinations can be interesting as well for audio source separation. If
the mixing matrix is approximately identical between training and testing, it can be
initialized by the training data and fixed for testing.
60
If there exists a reference with a similar time structure (in [SLVB15] music source
separation is discussed, where cover songs are used as references) it may be beneficial
to pretrain the activation matrices.
Of course it is also possible to initialize a parameter by its reference, but still update
it in the test optimization. This can be interpreted as exploiting the local optimality
of the NMF algorithms. If this initialization is for example applied for the dictionary,
one could describe this approach as finding a similar dictionary for that source, which
is more suitable for the new data. Due to the local optimality it is likely, that the
codewords in the j-th source dictionary still represent the j-th source and not any
other source.
4.3.2 Approximate additivity assumption
The training approach as described in the last chapter however has a major drawback.
In accordance with [SFM+14] we will term it approximate additivity assumption. Con-
sider a two source single channel scenario, where dictionary and activation matrix are
trained such that
YTR,j = DTR,jATR,j (4.6)
For simplicity the assumption is made that the training observations have the same
rank as dictionary and activation matrix (noiseless case) and the global optimum is
found, such that YTR,j = YTR,j .
If the dictionary is trained and fixed the activation matrix can be expressed by using an
activation function fA as ATR,j = fA(YTR,j ,DTR,j). The function fA depends on the
NMF algorithm (e.g. the used divergence) as well as the initialization. In general the
output of the activation function has some randomness due to the random initialization.
In the training a first criterion can thus be formulated as
YTR,j = YTR,j = DTR,2fA(YTR,j ,DTR,j) (4.7)
4.3. REDUCTION OF NON-STATIONARY EGO-NOISE 61
In the described training scenario it is essential, that the activation functions can also
be extracted from the observations of the mixture, given the trained dictionaries. It
differs from the training step, as the observations are dependent both on noise and
speech signals. In mathematical terms we express this dependency by the mixing
function fM as YTR = fM (YNTR,YXTR
). If an algorithm cannot obtain the activation
matrices from mixed training observations, it is unlikely, that the algorithm can do it
for the actual test observations.
If the observations represent the variance and speech and noise are statistical indepen-
dent, the speech and noise variance would add up. Yet in our model the observations
represent the instantaneous power, the observations thus do not simply add up (they
only do in average). Even worse, the mixing function is not deterministic regarding
the speech and the noise. Take, for example, the magnitude squared of the addition
of two complex values. The result is obviously dependent on the amplitude and the
phase difference of the two values. Yet as only the magnitude squared is observable,
there exists an (infinite) number of complex speech and noise samples that lead to the
same observation.
Regardless of the mixing function, the new estimate can be expressed as
ˆYTR = DTR,1ATR,1 +DTR,2ATR,2 (4.8)
(Perfect) separation/reconstruction is now performed, if ATR,1!= ATR,1 and ATR,2
!=
ATR,2.
At that point [SFM+14] claims that the activation matrices are approximately equiva-
lent if the dictionaries are sufficiently different and therefore the elements of the mixture
are modeled by only one dictionary. In the following we try to give a mathematical
background for this statement. The perfect reconstruction criterion for source 1 can
be rewritten as
fA(YTR,DTR,1) = ATR,1!= ATR,1 = fA(YTR,1,DTR,1) (4.9)
If the following four assumptions hold, a perfect reconstruction can be achieved:
62
• The initialization is perfect : The activation matrices are initialized in such a way
that ˆY TR = YTR
• Observations are additive: fM(YNTR,YXTR
) = YNTR+ YXTR
• The activation function is additive with respect to the observations : fA(YTR,1 +
YTR,2,DTR,1) = fA(YTR,1,DTR,1) + fA(YTR,2,DTR,1)
• The sources are separable: fA(YTR,2,DTR,1)!= 0
For example take the very simple linear function f(YTR,DTR,1 = YTRDTR,1. Obviously
the function is additive regarding the observations and the assumption that the first
codebook is unable to represent the observations of the second source simplifies to an
uncorrelatedness constraint: YTR,2DTR,1)!= 0. Each codeword has to be orthogonal to
all observation vectors over time. As both codeword and observations are orthogonal,
this is only given if for each frequency either all codewords of source 1 are zero or all
observation frames of source 2 are zero.
On the other hand, perfect reconstruction is not necessarily impossible if one of the
four above assumptions is not fulfilled, but the impact of the second source on the first
sources estimate is unknown in general.
Recently some new approaches to reference based source separation, that can take the
additivity requirement into account. A brief introduction into this new approach is
given in chapter 6.
Further on, it is investigated, whether it is important to train and use dictionaries
for all sources. In the context of robot ego-noise reduction, the sources reduce to one
source for the desired speech signal and one source for the undesired ego-noise. In
the same context of robot ego-noise reduction, [DK14] only uses a dictionary for the
noise. This approach is also possible for the NMF algorithms described in this work, as
the power of the untrained source is approximately the observed power of the mixture
minus the estimated power of the trained source.
4.3. REDUCTION OF NON-STATIONARY EGO-NOISE 63
In the next section we try to give insights about using NMF with only a noise codebook.
Afterwards the approach is compared to the same scenario with a speech and a noise
codebook. Some more insight into the approximate additivity requirement discussed
in this chapter, by investigating its impact on an actual algorithm.
4.3.3 Euclidean MF with a single dictionary
For simplicity the case of euclidean MF is considered. The nonnegativity constraint
is neglected for simpler mathematical derivations. The training noise observations are
denoted as YNTR, such that the training optimization problem is given by
[ANTR,DN ] = argmin
A,D
CEUC(YNTR,DA) = argmin
A,D
||YNTR−DA||2F (4.10)
As in NMF, the dictionary and activation functions are updated iteratively, and the
algorithms converges at some point. By setting the gradient to zero, the following
equations hold for the (local) optimum parameters:
DN = YNTRAT
NTR(ANTR
ATNTR
)−1 (4.11)
ANTR= fA(YNTR
,DN) = (DTNDN )
−1DTNYNTR
(4.12)
For a given dictionary we can thus form the estimate resulting from Euclidean matrix
factorization as
YNTR= DNANTR
= DN(DTNDN )
−1DTNYNTR
(4.13)
In equation 4.13 two observations can be made for the training estimator. The estima-
tor is
• linear.
• of rank K.
64
The perfect training criterion from equation 4.7 is now given as
YNTR= DN (D
TNDN )
−1DTNYNTR
(4.14)
or
(IK×K −DN (DTNDN)
−1DTN)YNTR
!= 0 (4.15)
and the perfect reconstruction criterion of equation 4.9 leads to
(DTNDN)
−1DTN YTR
!= (DT
NDN )−1DT
NYNTR(4.16)
by inserting the activation function. Using the mixing function YTR = fM(YXTR,YNTR
)
it rewrites to
(DTNDN )
−1DTN (fM(YXTR
,YNTR)− YNTR
)!= 0 (4.17)
If the mixing is linear, perfect reconstruction simplifies to
(DTNDN)
−1DTNYXTR
!= 0 (4.18)
Equation 4.18 can be interpreted as the following: The best estimate is achieved if the
noise dictionary is not able to represent the speech observations.
However at no point in the training process, the impact of the speech on the training
dictionary was considered. This is equivalent to assuming that the noise is white, it
can thus be represented equally good by any codebook.
Obviously an estimator trained for white noise will lose efficiency when corrupted by
noise more related to the desired observations. Similarly it will gain performance when
corrupted with noise less similar to the desired observations.
In the next chapter, it is shown, that the problem identified above does not vanish by
simply training a codebook for speech and noise.
4.3. REDUCTION OF NON-STATIONARY EGO-NOISE 65
4.3.4 Euclidean MF with two dictionaries
A codebook for speech and noise can simply be used by modifying equation 4.10 into
[ANTR,AXTR
] = argminAN ,AX
||YTR −DNAN −DXAX ||2F (4.19)
Similarly to equation 4.15 and 4.17, good estimates are achieved if the following equa-
tions are fulfilled:
(IK×K −DN(DTNDN )
−1DTN )YNTR
!= 0 (4.20)
(IK×K −DX(DTXDX)
−1DTX)YXTR
!= 0 (4.21)
(DTNDN )
−1DTNYXTE
!= 0 (4.22)
(DTXDX)
−1DTXYNTE
!= 0 (4.23)
The first two equations ensure that the codebook is able to represent the data and can
be fulfilled by perfect initialization and a suitable codebook size.
The last two equations make sure that the noise dictionary is unable to model the
speech observations and in addition the speech dictionary is unable to model the noise
dictionary. But again, none of this was considered in the training of the dictionaries
in previous NMF approaches.
This is again the property which was pointed out in [SFM+14] by stating that separate
pretraining of the dictionary is applicable if the dictionaries are sufficiently different.
Therefore the elements of the mixture are modeled by only one dictionary. Yet the
authors did not discuss the consequences, if the assumptions above are not fulfilled.
Thus we will do it in the following:
For simplicity, a scenario in which speech and dictionary both have an identical code-
word is assumed. Both codewords can thus represent the exact same data in the
frequency domain. Regarding the cost function it does however not matter, which
of the codewords is used; the cost stays the same. The algorithm therefore arbitrary
66
distributes the optimum activation onto the two codewords.
If equation 4.20 is fulfilled, the estimate on the noise parameter does obviously not
benefit from using the speech codebook and vice versa. If the equation is not fulfilled,
the algorithm arbitrarily distributes the joint parts of the spectrum into the speech
and noise dictionary. The arbitrary distribution is an undesired property.
In comparison with the single dictionary, we can thus conclude, that if two sources are
easily separable, it is not essential, if one or both sources are trained. For two similar
sources none of the training strategies is suitable.
Solving the arbitrary behavior by constraints
One interesting approach to solve this arbitrary behavior is the use of constraints.
Many works have proven that sparsity or temporal continuity constraints improve the
separation results. Sparsity can for example be interpreted as follows: Energy that
can be arbitrarily shifted between two codewords is now shifted (arbitrarily) to one
of the codewords in order to get a sparse representation. If the codewords are not
identical but can, for example, both model a certain frequency, a sparsity constraint
would model this frequency by codewords that are nonzero as they also model a part
the other algorithm cannot model.
For example, consider three codewords. One codeword only models frequency f1, the
second models frequency f2 and both belong to the same source. The third codeword
can model both frequencies with same amplitude and belongs to another source. As-
sume now, that the two frequencies are observed with same power. Without sparsity
the observations can be modeled by the equal combination of the first two codewords,
by the third codeword or any combination of them. But, if nonzero codewords are
penalized, the observations will be modeled by the third codeword only.
In a similar manner, a temporal continuity constraint solves the ambiguity by choosing
the ratio of the two activation coefficients that is best suited in relation with its neighbor
frames. By constraining the activation vectors by an L2 norm or higher, one could make
sure, that as many dictionaries as possible are used.
4.3. REDUCTION OF NON-STATIONARY EGO-NOISE 67
At that point it is interesting to note that in [SFM+14], the authors apply source
separation by NMF with identical codebooks by exploiting the temporal statistics of
the sources.
In this report it is tried to solve the problem described above by investigating two
different approaches: One approach for training a noise dictionary only at the end of
this chapter and one for a speech and noise dictionary, as described in the next section.
Penalizing similar codewords
An orthogonality constraint can be applied on speech and noise dictionary in accor-
dance to chapter 3.4.3. In order to do so, the training of the speech and noise dictionary
has to be performed jointly, for example by the following equation:
[DN ,ANTR,DX ,AXTR
] = argminDN ,AN ,DX ,AX
||YNTR−DNAN ||
2F
+||YXTR−DXAX ||
2F + λ
∑
kN ,kX
DTN,kN
DX,kX
(4.24)
The penalty on orthogonal components in equation 4.24 is problematic in the com-
bination with the nonnegativity constraint of the dictionary. Actually it penalizes
all dictionaries which have some nonzero frequencies in common. This is obviously
not desired. Nevertheless, one can argue, that the constraint penalizes more if two
components are similar.
One could also think about different penalty functions for similar codewords. For
example one could penalize identical codewords by
[DN ,ANTR,DX ,AXTR
] = argminDN ,AN ,DX ,AX
||YNTR−DNAN ||
2F
+||YXTR−DXAX ||
2F + λ
∑
kN ,kX
1
||DN,kN −DX,kX ||22
(4.25)
The usefulness of such penalty functions on similar codewords has still to be investi-
gated.
68
The approach of jointly training all dictionaries is related to recent publications, which
are briefly reviewed in chapter 6
4.3.5 Dictionary training based on desired and undesired data
Recall the estimate after training a single dictionary given by equation 4.13 with the
Euclidean distance:
YNTR= DNANTR
= DN(DTNDN )
−1DTNYNTR
The novel approach is now to train a dictionary which exactly matches the above
estimator. This can be achieved by optimizing the following cost function
[DN ] = argminDN
||Y NTR−DN (D
TNDN )
−1DTNYTR||
2F (4.26)
If a good solution can be found, the trained dictionary would have the following prop-
erties:
DN (DTNDN )
−1DTNYNTR
≈ Y NTR(4.27)
DN (DTNDN )
−1DTNYXTR
≈ 0 (4.28)
Interestingly equation 4.28 coincides (up to a multiplication with DN) with the earlier
defined criterion for a good estimator in equation 4.18 and equation 4.27 coincides with
the criterion for perfect training of a dictionary in equation 4.15.
The algorithm selects codewords by maximizing the squared difference of the represen-
tation of the part to be estimated (equation 4.27) to the representation of the part to
be suppressed (equation 4.28). Recall that in chapter 4.3.3 the noise codewords were
simply selected by their ability to represent the part to be estimated.
The proposed algorithm will hence not choose a codeword that is able to represent
the undesired part better than the desired part. And it also favors a codeword that
4.3. REDUCTION OF NON-STATIONARY EGO-NOISE 69
represents a medium amount of the part to be estimated and a small amount of the
part to be suppressed over a codeword that represents the parts to be estimated and
suppressed both well.
The original algorithm simply selects the codewords after the amount of representation
of the desired components. It can thus be interpreted as assuming that the undesired
part can be equally represented by all possible codewords, which is equivalent to as-
suming distortion by white noise.
Solving the optimization problem through relaxation
Still the problem of optimizing the dictionary in equation 4.26 is not addressed. The
gradient ∂∂Dn||Y NTR
−DN (DTNDN)
−1DTNYTR||2F , is difficult to calculate analytically.
For that reason the optimization problem is relaxed by introducing the matrix B =
(DTNDN)
−1DTN . The noise observations are now estimated as YNTR
= DNBYTR. The
relaxed optimization problem thus writes as
[DN ,B] = argminDN ,B
||Y NTR−DNBYTR||
2F (4.29)
5 The matrices can then iteratively be updated by
∂
∂DN
||Y NTR−DNBYTR||
2F = (YNTR
−DNBYTR)(BYTR)T != 0 (4.30)
DN = YNTRY T
TRBT (BYTRY
TTRB
T )−1 (4.31)
and
∂
∂B||Y NTR
−DNBYTR||2F = DT
N (YNTR−DNBYTR)Y
TTR
!= 0 (4.32)
B = (DTNDN)
−1DTNYNTR
Y TTR(YTRY
TTR)
−1 (4.33)
The above algorithm does not enforce a nonnegativity constraint on the matrices.
At this point we no longer see a necessity to do so. Nevertheless, it is likely that
multiplicative update rules can be derived to optimize equation 4.29 with nonnegativity
constraint.
70
The proposed method has also some drawbacks. First of all it is based on the Euclidean
distance and thus enforces a linear low rank estimator. Secondly it is based on the
assumption that the speech and noise observations are combined by an additive mixing
function fM .
Finally the special case where K = F is investigated. As DNB∆= U is now of full
rank, the optimum estimator can straightaway be calculated via the pseudo-inverse to
U = YNTRY T
TR(YTRYTTR)
−1 = YNTRY +
TR (4.34)
4.4 Multiplicative Update (MU) algorithm for mul-
tichannel data
In chapter 3.3.1 MU rules for the factorization of multichannel data by NMF with
the Itakura-Saito divergence was reviewed. In this chapter a generalization of this
algorithm for the α-β divergence is derived. The cost function is given by
CML(G,D,A) =∑
u
Dα,β(Xu||Xu) (4.35)
with the mixing model for the observations
Xu,fr =∑
j
Qu,j,f
∑
k∈Kj
DfkAkr (4.36)
The gradient for the new model is given by
∂Cα,β
(
X, X)
∂θ= −
∑
ufr
Xα+β−1u,fr
(
Xu,fr
Xu,fr
)α
− 1
α
∂Xu,fr
∂θ(4.37)
The MU rules for the new model can similar to equation D.33 be derived as
Akr ← Akr exp1−α
(
1
Aαkr
ηkr∑
uf
Xα+β−1u,fr ln1−α
(
Xu,fr
Xu,fr
)
∂Xu,fr
∂ ln1−α (Akr)
)
(4.38)
4.4. MULTIPLICATIVE UPDATE (MU) ALGORITHM FOR MULTICHANNELDATA 71
Again the gradients of the transformed parameters have to be calculated for the new
mixing model as
∂Xu,fr
∂ ln1−α (Akr)=
∂
∂ ln1−α (Akr)
∑
j
Qu,j,f
∑
k∈Kj
Dfk exp1−α (ln1−α (Akr)) = Qu,jk,fDfkA1−αkr
(4.39)
where jk denotes the source to which codeword k belongs. Inserted into equation 4.40
this rewrites as follows:
Akr ← Akr exp1−α
(
ηkr∑
uf
Qu,jk,fDfkA1−2αkr Xα+β−1
u,fr ln1−α
(
Xu,fr
Xu,fr
))
(4.40)
Again, defining suitable step-sizes
ηkr =A2α−1
kr∑
uf Qu,jk,fDfkXα+β−1fr
(4.41)
the MU rule write as
Akr ← Akr exp1−α
∑
uf Qu,jk,fDfkXα+β−1u,fr ln1−α
(
Xu,fr
Xu,fr
)
∑
uf Qu,jk,fDfkXα+β−1fr
(4.42)
For α 6= 0 the MU rules can be rewritten as
Aj ← Aj.
(
∑
u(diag(Qu,j)Dj)T (Xβ−1
u .Xαu )
∑
u(diag(Qu,j)Dj)TXα+β−1u
)1
α
(4.43)
By using the step-size ηfk =D2α−1
fk∑
ur Qu,jk,fAkrX
α+β−1
fr
the update rule for the dictionary can
be calculated in similar manner and writes as
Dj ←Dj .
(
∑
u diag(Qu,j)(Xβ−1u .Xα
u )ATj
∑
u diag(Qu,j)Xα+β−1u AT
j
)1
α
(4.44)
In the multichannel case a third update rule has to be derived for Qu,j,f . Using
∂Xu,fr
∂ ln1−α (Qu,j,f)= Q1−α
u,j,f
∑
k∈Kj
DfkAkr = Q1−αu,j,fDj,fAj,r (4.45)
72
the update rule is
Qu,j,f ← Qu,j,f exp1−α
(
ηu,j,fQ1−2αu,j,f
∑
r
Dj,fAj,rXα+β−1u,fr ln1−α
(
Xu,fr
Xu,fr
))
(4.46)
Using the stepsize ηu,j,f = 1
Q1−2αu,j,f
∑r Dj,fAj,rX
α+β−1
u,fr
the update rule writes as
Qu,j,f ← Qu,j,f exp1−α
∑
r Dj,fAj,rXα+β−1u,fr ln1−α
(
Xu,fr
Xu,fr
)
∑
r Dj,fAj,rXα+β−1u,fr
(4.47)
For α 6= 0 the update rules simplifies to
Qu,j ← Qu,j.
(
((DjAj).Xβ−1u .Xα
u )1R×1
((DjAj).Xα+β−1u )1R×1
)1
α
(4.48)
.
By inserting α = 1 and β = −1, such that the α-β-divergence becomes the Itakura-
Saito divergence, the update rules are equivalent to the version from [OF10], which
were reviewed in chapter 3.3.1.
4.5 The challenge of multiple mixtures
So far two algorithms were presented that can be applied to multichannel data. How-
ever, in the MU algorithm of chapter 4.4 the mutual information between the channels
is discarded.
Also, the EM algorithm described in chapter 3.3.2 has an undesirable property: its
computational complexity. The most computational expensive operator is the inverse
of the mixtures covariance matrix ΣY,fr, which is for example required to calculate
equation 3.31. The dimension of ΣY,fr is M×M . If no advanced inversion algorithm is
used, the computational complexity of this operation is O (M3). It is problematic that
the inversion has to be performed for each time-frequency bin and for each iteration.
When I denotes the number of iterations, the overall computational complexity for the
4.5. THE CHALLENGE OF MULTIPLE MIXTURES 73
EM algorithm is given as
O(
M3FRI)
(4.49)
In [OF10] the algorithm is used for binaural music separation (M = 2). In this work it
shall however be applied to a robot with 4 microphones. According to equation 4.49 the
shift fromM = 2 toM = 4 results in an 8 times higher computational complexity. 2×2
matrices can easily be inverted for all time and frequency bins at once by elementwise
matrix multiplication, division and addition. For 4× 4 matrices a similar approach is
also possible. However, instead of growing by factor 8, as suggested by equation 4.49,
the number of multiplications increases from 4 to 280 and the additions/subtractions
from 1 to 104, as sketched in appendix D.6.
Above’s comparison and experimental tests showed that in the given scenario, the
source separation cannot be performed in few hours. Anyway 4 microphones is not a
very high number for a robot and the algorithm’s complexity increases rapidly for even
more microphones.
As both methods considered so far have some drawback, an alternative approach is
proposed in this thesis. In spatial filter design it is very well known, that a multichannel
Wiener filter (MWF), which is the optimum linear MSE filter (see appendix C.0.5)
can be separated into an MVDR beamformer and an single channel postfilter (see for
example [BW01] chapter 3.2.2).
It is not granted that the optimum filter resulting from the NMF model for the reduc-
tion of non-stationary noise can be separated in an multichannel and single channel
part. Nevertheless it is worth trying, for the following two reasons:
• reduced computational complexity as the multichannel problem has to be solved
only once and not in every iteration
• all existing single channel algorithms can be applied after the multichannel pre-
processing
74
At that point some properties of the investigated ego-noise reduction scenario are
introduced that do not hold in general for audio source separation. By restricting the
scenario to 2 sources (ego-noise and speech), out of whom only the speech estimates
performance is of importance, it is sufficient to design a single multichannel filter for
the speech. Recall that a scenario is considered where microphone and loudspeaker
positions do not change over time. The propagation vector is therefore approximately
constant over time. This and the similarity of speech and noise between training and
testing allows to design a fixed filter with the training data and then apply it during
testing.
4.5.1 Multichannel Wiener filter (MWF)
As filter we propose to use the optimum spatial linear MSE estimator or often called
MWF for the u-th spatial sound source of the speech training data, denoted asXSTR,ufr:
minHf
E{|XSTR,ufr −HHf YTR,fr|
2} (4.50)
The algorithm is designed to estimate one of the spatial sound sources instead of the
true sources signals, as the true source signals are not always accessible for the training
data. As derived in appendix C.0.5, the optimum filter is
H = E{YTR,frYHTR,fr}
−1E{YTR,frX
∗STR,ufr} = Σ−1
Y Y,frΣY XS ,fr (4.51)
As the observations YTR,fr = XSTR,ufr+XNTR,ufr4 add up and speech and noise are un-
correlated E{XNTR,ufrX∗STR,ufr} = 0, the equation simplifies to Hf = Σ−1
Y Y,frΣXSXS ,fr.
The spatial covariance matrices can then be estimated as time average over the training
4At this point the mixing is performed in the complex time frequency domain, where two sources
actually add up. Hence there is not the problem with the unknown mixing behavior described in
chapter 4.3.2
4.5. THE CHALLENGE OF MULTIPLE MIXTURES 75
data by
ΣY Y,f =1
R
∑
r
YTR,frYHTR,fr (4.52)
ΣXSXS ,f =1
R
∑
r
XSTR,ufrX∗STR,ufr
∆= VSTR,ufdX (4.53)
The vector dX can be interpreted as the average propagation vector of the source
relative to the selected microphone. The MWF then rewrites as
Hf = Σ−1Y Y,f VSTR,ufdX (4.54)
and used for all frames of the testing data.
4.5.2 Parametric multichannel Wiener filter (PMWF)
In chapter 5.3.4 it will be pointed out, that the introduced speech distortion of the
MWF results in a bad performance in the following processing step.
The parametric multichannel Wiener filter (PMWF), as for example presented in
[SBA10] and [LN11] connects the multichannel Wiener filter with the MVDR (min-
imum variance distortionless response) beamformer. It is a well-known approach to
control the trade-off between speech distortion and noise reduction. The PMWF can
be derived by reformulating the optimization problem in equation 4.50. Using uncor-
relatedness between speech and noise, the cost function can be reformulated as
E{|XSTR,ufr −HHf YTR,fr|
2} = E{|(XSTR,ufr −HHf XSTR,fr)−HH
f XNTR,fr|2}
= E{|XSTR,ufr −HHf XSTR,fr|
2}+ E{|HHf XNTR,fr|
2}(4.55)
E{|XSTR,ufr −HHf XSTR,fr|2} represents the speech distortion and E{|HH
f XNTR,fr|2}
the noise reduction. A trade-off can easily be achieved by adding a weighting factor ν
to the speech distortion and thus solving the optimization problem
minHf
E{|HHf XNTR,fr|
2}+ νE{|XSTR,ufr −HHf XSTR ,fr|
2} (4.56)
76
If ν is interpreted as a Lagrangian multiplier, the optimization problem is equivalent
to
minHf
E{|HHf XNTR,fr|
2}
subject to E{|XSTR,ufr −HHf XSTR,fr|
2} < ǫ
(4.57)
where ǫ is the allowed signal distortion. The PMWF can then be derived as
Hf = (µΣXNXN ,fr +ΣXSXS ,fr)−1VSTR,ufrdX (4.58)
with the substitution 1ν
∆= µ. For µ→ 0 the beamformer does not care about the noise
reduction and only minimizes the speech distortion. Unfortunately there is a problem,
if the speech is not created by a point source or if the microphone signals have some
additive noise. In that case, no distortion can only be achieved by the trivial solution,
where the reference microphone receives gain 1 and all other microphones gain 0.
4.5.3 MVDR beamformer
For this reason the MVDR beamformer is derived in its original version [CZO87] by
the following optimization problem:
minHf
E{|HHf YTR,fr|
2} (4.59)
subject to HHf df = 1 (4.60)
Equation 4.59 has the well known solution
Hf =Σ−1
Y Y,frdf
dHf Σ
−1Y Y,frdf
(4.61)
4.5. THE CHALLENGE OF MULTIPLE MIXTURES 77
Compared to the MWF of equation 4.54, the MVDR beamformer is just scaled in a
way, that the looking direction (represented as d) is not distorted.
If the speech is a point source with propagation vector d the optimization problem can
be equivalently rewritten as
minHf
E{|HHf XNTR ,fr|
2} (4.62)
subject to HHf df = 1 (4.63)
Again the solution is
Hf =Σ−1
NN,frdf
dHf Σ
−1NN,frdf
(4.64)
It is important to consider that, as the propagation vector is only an average propaga-
tion vector of the source, the MVDR beamformer will indeed introduce distortion. A
distortionless response is only granted for a point source with the exact propagation
vector. It is however likely, that the distortions are less severe than for the MWF
without obtaining the trivial solution of the PMWF for µ = 0.
78 CHAPTER 5. EXPERIMENTAL RESULTS
Chapter 5
Experimental results
The experimental results are summarized in this chapter. First the experimental setup
is explained. Afterwards the reduction performance of the stationary noises achieved
through the prefiltering is discussed. The chapter is then finished with a comparison
of the results of the different movement noise reduction algorithms.
5.1 Experimental setup
If not stated otherwise in the specific sections, the following setup and parameters are
used for the experiments evaluated later on.
The samples are recorded with a sample rate of 48 kHz, but downsampled to 16 kHz
to reduce the computational complexity. The algorithms are performed in the STFT
domain. Each frame is a rectangular window of length 256 and the individual frames
do not overlap.
The training and test signals are obtained as the following: All recordings are made
in the same room, while NAO was standing or walking on the same spot. The fol-
lowing four signals are recorded with the robot NAO in in a real room with moderate
reverberation level (T60≈200ms):
• A speech signal is played back by a loudspeaker. It contains short sentences of
5.2. PREFILTERING 79
different speakers. It is recorded, while the fan was off, thus it contains only the
speech and some static noise.
• A fan noise reference is recorded, containing only fan and static noise.
• Walking noise is recorded, where the robot walks on the spot. As the fan has to
be turned on, the recording consists of fan, static and waving noise.
• Waving noise is recorded, where the robot walks on the spot. As the fan has to
be turned on, the recording consists of fan, static and waving noise.
Either the walking or the waving noise is used for the experiments at the same time,
never both. For simplicity, the walking or waving noise recording, that is currently
used for an experiment is referred to as movement-noise.
From the speech and the noise recording 20 seconds are extracted starting at 3 seconds
and form the testing data. Then a random offset of up to 1000 samples is applied
and another 100 seconds are extracted to form the training data. The random offset
is applied to avoid getting the exactly same results when running a ’deterministic’
algorithm several times. It is thus prevented to get a particularly good or bad results
by chance.
5.2 Prefiltering
This section evaluates the performance of the fan spike reduction and the static and
residual fan noise reduction via the prefilters.
5.2.1 Fan noise reduction
First the fan noise reduction algorithm as described in chapter 4.2.1 is applied to all
signals. The training is performer with the separate fan noise recording and then
applied to each microphone of speech and noise, training and test data.
80
Obviously the algorithm removes energy from all the signal. In order to determine,
whether the algorithm works as planned, the residual to artifact ratio (RAR) as defined
by equation 5.1 is used.
RAR = 10 log
(
Vout
Vin − Vout
)
[dB] (5.1)
Vin is the power of the signal before the algorithm and Vout the power of the signal
after the algorithm. The RAR is used, as it can be calculated individually for each
of the sources in the mixture (although not all of them can be observed individually).
It is thus a measure of how each source is affected by the algorithm. A RAR of ∞
dB corresponds to a signal not changed by the algorithm and a RAR of −∞ dB to a
complete suppression. The corresponding RAR values for all signals and microphones
can be seen in table 5.1. The distortion of speech, static, walking and waving noise is
considerably small for all microphones, but the distortion or in other words, reduction
of the fan noise is quite large. In average there are 4.9 dB more fan noise removed than
remaining.
Table 5.1: The residual to artifact ratio after fan spike reduction shows a reduction ofthe fan noise, while the other signals are hardly affected
mic 1 mic 2 mic 3 mic 4 average
RAR speech [dB] 11.5 11.5 11.7 11.8 11.6
RAR static [dB] 11.8 9.8 10.5 11.0 10.8
RAR fan [dB] -3.3 -4.7 -6.4 -5.1 -4.9
RAR walking [dB] 12.9 14.7 9.6 11.9 12.3
RAR waving [dB] 11.2 11.4 13.2 12.3 12.0
5.2.2 Static and residual fan noise reduction
The second filter, the single channel Wiener filter described in chapter 4.2.2, is applied
to each microphone of speech and noise, training and test data as well. As the filter
5.2. PREFILTERING 81
depends on the power of the mixture, its impact on the static and fan noise is calculated
with the average filter of the walking mixture.
Although experiments showed that a slightly better result can be obtained by larger
frames, a rectangular window of size 256 is applied, as this frame-size is used later for
the movement noise reduction, too.
The RAR values in table 5.2 are again calculated in relation to the input before the
fan noise reduction. They contain thus the distortion/reduction introduced by the fan
spike reduction algorithm and the Wiener filter.
Table 5.2: The RAR after Wiener filtering shows a reduction of static and fan noisewith few distortion of the other sources
mic 1 mic 2 mic 3 mic 4 average
RAR speech [dB] 8.3 8.4 8.8 8.8 8.5
RAR static [dB] -1.6 -1.7 -1.6 -1.6 -1.6
RAR fan [dB] -6.7 -7.8 -9.4 -8.1 -8.0
RAR walking [dB] 6.4 6.5 5.1 4.0 5.5
RAR waving [dB] 8.2 8.4 9.6 8.4 8.6
With an average of 8.5 dB, the artifacts (distortion) of the speech signal is still accept-
able, whereas most of the fan noise is removed. More than half of the static noise is
reduced as well.
5.2.3 Prefiltering results
After proving that the introduced speech artifacts are small, it is investigated, how the
speech to interference ratios is affected by the prefilters. An overview is given in table
5.3. As it can be seen, microphone 4 has the best speech to interference ratio (SIR)
regarding all noise types. The speech to static ratio increases about 3 dB through
the Wiener filter and with values over 20 dB and can be neglected further on. The
82
speech to fan ratio increases about 6 dB through the spike reduction and additional
2 dB through the Wiener filtering. With values from 10 to 16 dB it is also negligible
compared to the movement noise, which has SIR values below 8 dB, as shown later.
Table 5.3: Impact of the prefilters on the speech to interference ratio (SIR) of staticand fan noise
mic 1 mic 2 mic 3 mic 4
static SIR at input [dB] 18.0 18.8 22.4 22.7
static SIR after fan spike reduction [dB] 18.0 18.9 22.5 22.8
static aSIR fter WF [dB] 21.3 22.1 25.7 26.1
fan SIR at input [dB] 2.3 2.6 3.3 8.3
fan SIR after fan spike reduction [dB] 7.0 8.3 10.3 14.3
fan SIR after WF [dB] 9.2 10.5 12.6 16.5
Average fan and static SIR gain by both prefilters 6.9 7.9 9.3 8.2
5.2.4 Evaluation approach for movement reduction
For the performance evaluation of the movement noise reduction algorithms, the speech
and movement estimates obtained through the prefiltering are used as reference. The
performance measures of the movement noise reduction is thus not limited through
the artifacts of the prefilters. If an algorithm would perform perfect separation of the
speech signal, the resulting estimate will not be the true speech signal, but the speech
recording distorted by the two prefilters (with the residual static noise). As the RAR
in table 5.2 for speech and movement noise is between 5 to 9 dB this approach can be
justified.
The SIR can then also be determined for the input of the movement noise reduction
step. It is listed in table 5.4. The movement noise includes in the new definition also
the residual fan noise power, as does the speech signal include the residual static noise
power. In the table the BSS toolbox version 3.0 is used for the calculation of the
performance. It is further on used for the performance measure of the movement noise
5.2. PREFILTERING 83
reduction.
Table 5.4: SIR and SDR by BSS toolbox is equivalent after prefiltering
SIR [dB] SDR [dB]
mic 1 mic 2 mic 3 mic 4 mic 1 mic 2 mic 3 mic 4
walking via BSS -0.6 1.1 4.3 7.4 -0.7 1.0 4.2 7.3
waving via BSS -8.6 -7.3 -6.3 -3.1 -8.7 -7.4 -6.3 -3.1
The BSS toolbox is especially designed for performance evaluation of source separation
algorithms. Roughly explained, the toolbox takes into account, that the estimated
speech can be expressed as the sum of four components:
Xtarget = Xtarget +Xinterferer +Xnoise +Xartifacts (5.2)
with X = [Xt]T represents a vector of time samples. The four terms can be interpreted
as follows:
• Xtarget: desired source (speech) in the estimate
• Xinterferer: undesired/interfering sources (movement noise) in the estimate
• Xnoise: residual of the estimate, that is not included in one of the sources
• Xartifacts: represents the deviation of the desired source in the estimate compared
to the true signal of the desired source (comparable to the artifacts used in
equation 5.1)
For a more sophisticated explanation, we forward at this point to the authors website1.
Furthermore the noise term is assumed to be zero. In this case, two of the BSS
parameters are sufficient to describe the separation performance. Hence the source to
distortion ratio (SDR) and the source to interference ratio (SIR), defined as follows,
1http://bass-db.gforge.inria.fr/bss_eval/
84
are chosen:
SIR = 10 log10||Xtarget||
22
||Xinterferer||22(5.3)
SDR = 10 log10||Xtarget||22
||Xinterferer +Xnoise +Xartifacts||22(5.4)
The SDR is the measure that is used to determine which of the algorithms performs
best. The numerator of the SDR is not changed by the algorithm, whereas the denom-
inator can be reformulated to the MSE of the estimate ||Xtarget − Xtarget||22 by using
equation 5.2.
In addition the SIR is useful to determine the relation of artifact and interferer power. If
SIR≫ SDR the artifacts are the main problem, if SIR ≈ SDR the are no severe arti-
facts and if SIR ≈ SDR+3dB artifacts and interference are approximately equivalent.
In table 5.4, by simply using the prefiltered microphone signal, a first separation esti-
mate is already listed. As it can be seen, SIR and SDR are identical, as the distortion
in the estimate is just the additive interfering movement-noise (no artifacts).
5.3 Movement noise reduction
For the movement noise reduction four different multichannel algorithms are applied:
• α-β MU with prefiltering by a MWF, where a speech and a noise codebook are
pretrained
• α-β MU with prefiltering by a MWF, where only a noise codebook is pretrained
• Euclidean MU with prefiltering by a MWF, where a noise codebook is pretrained
such that it predicts the noise power from the observations (see chapter 4.3.5).
• α-β MU for multichannel data, where a speech and a noise codebook are pre-
trained
5.3. MOVEMENT NOISE REDUCTION 85
Each algorithm is tested for different codebook sizes K and if applicable for the fol-
lowing parameters α and β:
Table 5.5: Selected parameter for the generalized α-β divergence
α β
1.0 1.0 squared Euclidean distance
1.0 0.0 Kullback-Leibler divergence
1.0 -1.0 Itakura-Saito divergence
0.5 1.0
2.0 1.0
1.0 -2.0
1.0 2.0
The complete results can be seen in appendix A andB. In the following section, the
best performance for each algorithm (regarding codebooksize K and parameters α, β)
is discussed.
5.3.1 Experimental results for single channel data
Before proceeding to the multichannel results this chapter shows, that the three single
channel methods indeed achieve a performance increase for single channel data.
As shown in table 5.6 the best SDR is achieved by training a speech and noise codebook.
A SDR increase of 5.6 dB for walking and 10.3 dB for waving can be achieved. With
1 and 9 the codebook sizes are small. The predictive algorithm performs second best
and NMF with only the noise codebook performs worst. If selected after best SIR, the
order is the same (see table 5.7).
5.3.2 Performance of PMWF for spatial preprocessing
As described in chapter 4.5 a beamformer can be used as additional prefilter for solving
the multichannel problem with less computational complexity. The algorithm proposed
86
Table 5.6: Single channel overview; parameters select after best SDR
walking waving
Algorithm α; β K SIR SDR α; β K SIR SDR
without an algorithm - - -0.6 -0.7 - - -8.6 -8.7
α-β MU, speech + noise 1;0 1 8.1 4.9 0.5;1 9 3.4 1.6
α-β MU, noise 1;1 6 5 1.9 1;-1 129 -5.4 -0.5
Euclidean predictive, noise - 90 5.9 2.7 - 129 -0.5 -0.9
Table 5.7: Single channel overview; parameters select after best SIR
walking waving
Algorithm α; β K SIR SIR α; β K SIR SDR
without an algorithm - - -0.6 -0.7 - - -8.6 -8.7
α-β MU, speech + noise 1;2 4 12.9 1.9 0.5;1 9 3.4 1.6
α-β MU, noise 2,1 27 6.6 1.5 1;0 27 -2.6 -1.1
Euclidean predictive, noise - 80 5.9 2.7 - 22 0 -1
earlier is the parametric multichannel Wiener filter (PMWF). First the parameter
µ that controls the tradeoff between noise reduction and speech artifacts has to be
determined. As it can be seen in figure 5.1 (for completeness the corresponding figure
for the waving noise is shown in appendix B.2.1), the SDR is best for µ = 1, where
the PMWF simply becomes the multichannel Wiener filter (MWF). This result is not
surprising, as it is well known, that the MWF is the optimum linear estimator regarding
MSE. For this reason the MWF is selected as prefilter. The achieved performance and
the corresponding improvement is listed in table 5.8.
Furthermore figure 5.1 shows, that for µ→ 0, SIR and SDR converge towards the input
performance of the microphone that was used as reference. The estimated filter is thus
the trivial solution, where one microphone is completely preserved and all others are
attenuated to zero. Choosing µ = 0 would thus result in the performance shown in the
single channel result section. Though, it has to be mentioned, that the single channel
5.3. MOVEMENT NOISE REDUCTION 87
evaluation was as explained earlier performed with the first microphone, which does
not have the best input SDR. Table 5.8 shows that microphone 3 is best for both noise
10 log10
(µ)-60 -50 -40 -30 -20 -10 0 10 20 30 40 50
SIR
[dB
]
0
10
20
10 log10
(µ)-60 -50 -40 -30 -20 -10 0 10 20 30 40 50
SD
R [d
B]
0
5
10
10 log10
(µ)-60 -50 -40 -30 -20 -10 0 10 20 30 40 50
SA
R [d
B]
0
10
20
30
Figure 5.1: SIR, SDR and SAR of the parametric Wiener filter (walking noise)
Table 5.8: BSS performance after MWF (µ = 1)
SIR [dB] SDR [dB]
mic 1 mic 2 mic 3 mic 4 mic 1 mic 2 mic 3 mic 4
waving after MWF 15.0 14.5 15.2 14.6 6.2 6.4 7.2 7.3
waving improvement 23.6 21.8 21.5 17.7 14.8 13.8 13.6 10.4
walking after MWF 19.5 18.5 20.6 18.5 10.7 11.7 13.3 13.5
walking improvement 20.0 17.4 16.3 11.1 11.4 10.7 9.1 6.1
types regarding SIR, whereas microphone 4 is best regarding SDR. As microphone 1
was most of the time used as reference for the experiments, it is also picked as reference,
although the MWF designed for microphone 4 performs better.
88
5.3.3 Experimental results for multichannel data
As stated earlier, the main criterion for selecting the best algorithm is the SDR. The
four multichannel algorithms are listed in table 5.9, with their optimum parameters.
As indicated by the bold letters the MWF results in the best SDR. The three NMF
methods applied after the MWF thus always lead to a loss in SDR. Furthermore the
true multichannel algorithm (last row) has a far worse SDR than the beamformer with
a loss of -9.7 dB for walking and -5.1 dB for waving. For reference the table also
shows the single channel results and the performance of PO-KSVD+ from [DK14] on
the same scenario. As it can be seen, the multichannel algorithm performs even worse
than the single channel. The single channel method even achieves a better SDR for
the walking noise compared to PO-KSVD+, but performs worse for the waving noise.
Table 5.9: Multichannel overview best SDR
walking waving
Algorithm α; β K SIR SDR α; β K SIR SDR
best input microphone - - 7.4 7.3 - - -3.1 -3.1
MWF - - 19.8 10.7 - - 15.7 6.2
α-β MU, prefilter, speech + noise 1; -1 1 25.3 7.9 1; 0 1 22.2 5.7
α-β MU, prefilter, noise 1; 2 1 19.4 2.4 1; 2 1 13.2 2.6
Euclidean pred., prefilter, noise 50 21.6 4.9 80 18.7 3.3
α-β MU, multich., speech + noise 1; 2 24 19.8 1 2;1 18 24.1 1.1
α-β MU, single ch., speech + noise 1;0 1 8.1 4.9 0.5;1 9 3.4 1.6
PO-KSVD+ from [DK14] 22.3 1.8 22.5 2.3
The results are different, if one would only care about the SIR. The corresponding
performances are listed in table 5.10.
The α-β MU with prefilter and a speech and noise codebook with small codebook size
(K = 4) achieves here a quite impressive SIR increase by +5.2 dB compared to the
MWF alone, but the SDR decreases by −7.7 dB for the walking noise.
5.3. MOVEMENT NOISE REDUCTION 89
Interestingly the multichannel algorithm achieves best SIR for the waving noise. The
SIR increases by 11.6 dB, whereas the SDR drops by 5.1 dB.
Table 5.10: Multichannel overview best SIR
walking waving
Algorithm α; β K SIR SDR α; β K SIR SDR
α-β MU, prefilter, speech + noise 1; 1 4 26 3 1,-1 1 25.4 5
α-β MU, prefilter, noise 1; -1 1 22 2.3 1;-1 129 20 1.6
Euclidean pred., prefilter, noise - 50 21.6 4.9 - 129 18.7 3.3
α-β MU, multich., speech + noise 1; 0 26 23.4 0 0.5; 1 4 27.3 1.1
Several conclusions can be drawn from the data of the last two chapters:
• The three investigated single channel algorithms improve the SDR of single chan-
nel data, yet they cannot do so after MWF as spatial prefilter.
• The multichannel NMF algorithm can perform some improvement, yet its SDR
is worse compared to the equivalent single channel algorithm.
• All multichannel algorithms result in a quite impressive SIR.
In order to explain, why the NMF algorithms after a MWF result in a increase in
SIR but decrease in SDR, the behavior of desired source, interferer and artifacts is
investigated before, between and after the two stages.
5.3.4 Discussion on using NMF with MWF as spatial prefilter
At the input of the MWF, no artifacts are assumed, such that SIR and SDR are
equivalently given as
SIRin = SDRin = 10 log10||Xtarget,in||22||Xinterferer,in||22
(5.5)
90
The first stage represented byMWF reduces the interferer error term asXinterferer,inMWF→
Xinterferer,MWF . By doing so, the target source is also distorted. It is reduced accord-
ing to Xtarget,inMWF→ Xtarget,MWF . On the other hand, the modification of the target
by the stage can be interpreted as adding some artifacts to the signal, where the artifact
error term is Xartifact,MWF = Xtarget,in −Xtarget,MWF .
The SDR at the output of the first stage is then given as
SDRMWF = 10 log10||Xtarget,in||22
||Xinterferer,MWF +Xartifact,MWF ||22(5.6)
From equation 5.5 to 5.6 only the denominator changes. The best estimator regarding
SDR is thus the one that finds the best trade-off between average interferer and artifact
power.
As mentioned in chapter 5.2.4, the denominator can also be seen as the MSE of an
estimate. Minimizing the MSE is hence equivalent with maximizing the SDR. As
pointed out earlier, the MWF uses the MSE cost function. The residual cost of the
estimator is given by
CMSE,MWF = ||Xinterferer,MWF +Xartifact,MWF ||2 (5.7)
The second stage, abbreviated byNMF further reduces the interferer asXinterferer,MWFNMF→
Xinterferer,NMF while introducing the artifactsXartifact,NMF = Xtarget,MWF−Xtarget,NMF .
The MSE cost of the second stage is similar to equation 5.7
CMSE,NMF = ||Xinterferer,NMF +Xartifact,NMF ||2 (5.8)
In contrast to the MWF, the generalized α-β divergence MU rules do not minimize the
MSE for all parameters. Thus they do not maximize the SDR. But even if they would
minimize the MSE, this is no longer equivalent to maximizing the overall SDR given
by
SDRMWF+NMF = 10 log10||Xtarget,in||2
||Xinterferer,NMF +Xartifact,MWF +Xartifact,NMF ||2(5.9)
5.3. MOVEMENT NOISE REDUCTION 91
The second stage should consider the artifacts introduced in the first stage, yet it is
independent of those artifacts. Resulting from that, the SDR drops and the SIR, which
is independent of the artifacts, increases. Another interpretation is, that the second
stage can only reduce the artifacts of the first stage by chance, as it does not know
them. At this point this mismatch can be tackled by two different approaches:
• The second stage has to estimate the speech input of the first stage rather than
the speech output of the first stage.
• The first stage is not allowed to add artifacts.
The first idea is difficult to transfer to the NMF algorithms used as second stage. It
may be possible to adopt the predictive Euclidean matrix factorization to account the
first stage. As a drawback the Wiener filter can then no longer be restricted to values
between 0 and 1, as the the second stage has to be able to equalize the attenuation
(and thus artifacts) of the first stage. The only possibility left is to keep the artifacts
of the first stage at a minimum. As shown earlier, the PMWF could only do so by the
trivial solution of using the reference microphone only. In the next chapter the MVDR
beamformer in its original form is therefore tested.
5.3.5 Performance of MVDR beamformer for spatial prepro-
cessing
As explained in chapter 4.5.3 the MVDR beamformer can be calculated with two
versions, which are equivalent for a not moving point source. The achieved performance
for both versions is shown in table 5.8 for the waving noise. The SDR can be slightly
increased by both methods. However compared to the MWF, the resulting SDR is
significantly worse.
The table shows that SIR≫ SDR regardless which microphone is used as reference or
which method is used. The artifacts are thus more troublesome than the residual inter-
ference. In contrast to the theoretic background, the MVDR beamformer introduces
92
severe artifacts. The objective of no speech distortion is not given.
Table 5.11: BSS performance after MVDR
SIR [dB] SDR [dB]
mic 1 mic 2 mic 3 mic 4 mic 1 mic 2 mic 3 mic 4
waving MVDR, ΣNN 12.7 12.6 11.5 10.1 -7.6 -6.5 -1.8 -2.2
waving improvement 21.3 19.9 17.8 13.2 1.1 0.9 4.5 0.9
waving MVDR, ΣY Y 11.6 11.3 11.3 9.8 -7.5 -6.4 -1.8 -2.2
waving improvement 20.2 18.7 17.6 12.9 1.2 1.0 4.5 0.9
The only reasonable explanation is that the used relative propagation function is not
at all constant over time. This can also be observed with the actual data. The average
propagation vector is thus not suited well for most of the frames. In all those frames
the speech signal gets distorted.
The reason for this behavior is however not clear. If the speech source would move
around or is not point like, a changing propagation vector over frames would make
sense. Yet a loudspeaker is static and pretty point-like and the robot was standing still
as well.
At this point it can only be guessed, that the frame length of 256 samples was not
well suited. We conclude this chapter by pointing out, that the investigates spatial
preprocessing methods introduced to many speech artifacts for a postprocessing to
make sense.
CHAPTER 6. CONCLUSION 93
Chapter 6
Conclusion
The experimental results of this work show that a noise reduction by a Wiener filter
with variance parameters estimated by NMF is possible. With the generalized α-β
divergence, a SDR increase of up to 5 to 10 dB, depending on the input SDR, is possible.
The results strongly depend on the used parameters, the number of codewords as well
as the training technique. The experiments clearly show that the best parameters
depend on the scenario.
Using a speech and noise codebook proves to be in general superior to the case where
only a noise codebook was used. A novel matrix factorization approach with the pur-
pose to account the pretraining of the codebook by a reference signal is derived for
the Euclidean cost function. It performs better than the Euclidean NMF with a single
codebook for both tested movement noises, yet fails to outperform Euclidean NMF
with speech and noise codebook. The new estimator is indeed linear and of rank K. It
highlights the fact that the form the standard NMF procedure uses the observations
are unknown due to the iterative optimization. Also, the new approach does not fea-
tures the NMF typical performance loss for large codebooks. This property confirms
the theoretical investigation of NMF with pretrained dictionaries. We have suggested,
that the discussed training approach is based on several assumptions, which are not
given in most scenarios and can lead to a severe loss in performance.
94
The experimental results with the multichannel data show that NMF is, in the way it
was used in this work, not suited for multichannel ego-noise reduction. The multichan-
nel MU rules for the generalized α-β divergence performs worse regarding SDR than
applied to a single channel.
The novel two stage approach with a beamformer as spatial prefilter does not work in
combination with any of the applied algorithm. Although the spatial filter achieves
significant increase in SDR, supported by the time invariant spatial properties, the
NMF algorithm actually leads to a decrease in SDR. The evaluation shows that the
NMF algorithm does not take into account the artifacts created by the MWF, which
are more severe than the residual interferer power. The MVDR beamformer is tried as
an alternative spatial prefilter. The use of an average relative transfer function, turns
out to be not suitable to keep the speech artifacts in a tolerable level.
Future work
This work develops several ideas which can be investigated in future works.
The concept of multichannel prefiltering can further be investigated with the objective
to find a beamformer with little artifacts on the speech signal. In this context, the
evaluation of the SDR as a decisive performance criterion has to be questioned and
should be compared to the keyword recognition rate of an actual recognition engine.
The results of the fixed MWF suggest that if the spatial covariance matrix of the ego-
noise can be estimated accurately enough, a very good noise reduction performance
can be achieved. At this point, the incorporation of the motor data seems to be very
promising.
The proposed interpretation of NMF models via an activation function and a mixing
function is another starting point for further research. Especially, the property of the
mixing to be non-deterministic regarding the observations of the individual sources
requires the development of mixing models for the observations via probability distri-
95
butions.
The theoretical work on the pretraining of the dictionary results in the claim that
the NMF algorithm can result in arbitrary distribution of the power in the activation
functions. Both experimental and theoretical investigations of the behavior could lead
to a better understanding of source separation using NMF. In the course of that,
constrained NMF becomes of new importance.
Finally the development of better NMF training approaches is desired. The impor-
tance of new training approaches is underlined by recent developments in [SLOVB14]
and [SLVB15] on their so called deformed references. As a prospect to future NMF
techniques a review on their ideas will conclude this report.
Instead of two individual training and testing stages, the authors propose to share pa-
rameters (for example the dictionaries) between training and testing. Parameters are
hereby shared by transition probabilities. The optimization problem is then given by
minimizing the training and the testing cost function. Transferring this idea onto the
approach of this work, it should be possible to train a speech and a noise dictionary
such that the speech dictionary represents the speech training observations, the noise
dictionary the noise training observations, and together they could represent the train-
ing mixture observations. It is likely that this approach is not the perfect solution, but
still better suited regarding the approximate additivity assumption.
96 ANHANG A. SINGLE CHANNEL RESULTS
Appendix A
Single channel results
A.1 walking
A.1.1 α-β-divergence with speech and noise codebook
K0 20 40 60 80 100 120
SD
R [d
B]
0
2
4
α=1; β=1α=1; β=0α=1; β=-1α=0.5; β=1α=2; β=1α=1; β=-2α=1; β=2
K0 20 40 60 80 100 120
SIR
[dB
]
0
5
10
Figure A.1: Walking-noise reduction by the α-β-divergence and speech and noisecodebook (single channel)
A.1. WALKING 97
Table A.1: Walking-noise reduction by the α-β-divergence and speech and noise code-book (single channel)
α;β 1.0; 1.0 1.0; 0.0 1.0; -1.0 0.5; 1.0 2.0; 1.0 1.0; -2.0 1.0; 2.0
Selected after best SIR
K 2 3 2 1 1 1 4
SIR [dB] 12.7 8.9 6.3 11.5 11.8 3.2 12.9
SDR [dB] 2.2 4.0 3.1 3.4 3.4 1.5 1.9
Selected after best SDR
K 10 1 7 16 6 1 80
SIR [dB] 8.1 8.1 5.9 7.6 8.2 3.2 8.1
SDR [dB] 3.9 4.9 3.9 4.4 4.3 1.5 3.4
A.1.2 MF with the α-β-divergence and Euclidean predictive
by a noise codebook
K0 20 40 60 80 100 120
SIR
[dB
]
0
2
4
6
α=1; β=1α=1; β=0α=1; β=-1α=0.5; β=1α=2; β=1α=1; β=-2α=1; β=2
K0 20 40 60 80 100 120
SD
R [d
B]
-0.5
0
0.5
1
1.5
2
Figure A.2: Walking-noise reduction by the α-β-divergence with noise codebook
98
K0 20 40 60 80 100 120
SD
R [d
B]
-10123
K0 20 40 60 80 100 120
SIR
[dB
]
0
2
4
6
Figure A.3: Walking-noise reduction by Euclidean predictive MF (single channel)
Table A.2: Walking-noise reduction by the α-β-divergence with noise codebook
α;β 1; 1 1; 0 1; -1 0.5; 1 2; 1 1; -2 1; 2 EUC pred.
Selected after best SIR
K 15 10 4 14 27 1 16 80
SIR [dB] 5.3 4.9 3.8 4.3 6.6 5.6 5.0 5.9
SDR [dB] 1.7 1.5 1.5 1.7 1.5 1.7 1.6 2.7
Selected after best SDR
K 6 10 60 5 4 1 11 90
SIR [dB] 5.0 4.9 3.2 4.1 5.6 5.6 4.7 5.9
SDR [dB] 1.9 1.5 1.6 1.8 1.6 1.7 1.6 2.7
A.2. WAVING 99
A.2 waving
A.2.1 α-β-divergence speech and noise codebook
K0 20 40 60 80 100 120
SD
R [d
B]
-10
-8
-6
-4
-2
0
2
α=1; β=1α=1; β=0α=1; β=-1α=0.5; β=1α=2; β=1α=1; β=-2α=1; β=2
K0 20 40 60 80 100 120
SIR
[dB
]
-10
-5
0
Figure A.4: Waving-noise reduction by the α-β-divergence and speech and noise code-book (single channel)
Table A.3: Waving-noise reduction by the α-β-divergence and speech and noise code-book (single channel)
α;β 1.0; 1.0 1.0; 0.0 1.0; -1.0 0.5; 1.0 2.0; 1.0 1.0; -2.0 1.0; 2.0
Selected after best SIR
K 3 2 1 9 3 29 9
SIR [dB] 1.4 2.4 3.4 3.4 -2.7 -4.1 1.4
SDR [dB] 0.2 -0.6 0.3 1.6 -3.5 -0.8 0.4
Selected after best SDR
K 3 3 1 9 3 29 9
SIR [dB] 1.4 2.0 3.4 3.4 -2.7 -4.1 1.4
SDR [dB] 0.2 -0.3 0.3 1.6 -3.5 -0.8 0.4
100
A.2.2 MF with the α-β-divergence and Euclidean predictive
by a noise codebook
K0 20 40 60 80 100 120
SD
R [d
B]
-10
-5
0
α=1; β=1α=1; β=0α=1; β=-1α=0.5; β=1α=2; β=1α=1; β=-2α=1; β=2
K0 20 40 60 80 100 120
SIR
[dB
]
-10
-5
0
5
Figure A.5: Waving-noise reduction by the α-β-divergence with noise codebook
K0 20 40 60 80 100 120
SD
R [d
B]
-8
-6
-4
-2
0
K0 20 40 60 80 100 120
SIR
[dB
]
-8-6-4-20
Figure A.6: Waving-noise reduction by Euclidean predictive MF (single channel)
A.2. WAVING 101
Table A.4: Waving-noise reduction by the α-β-divergence and Euclidean predictivewith noise codebook
α;β 1; 1 1; 0 1; -1 0.5; 1 2; 1 1; -2 1 1; 2 EUC pred.
Selected after best SIR
K 80 27 60 90 129 1 100 22
SIR [dB] -3.4 -2.6 -5.2 -3.9 -3.4 5.2 -3.9 0.0
SDR [dB] -2.3 -1.1 -1.2 -2.6 -2.3 0.0 -3.1 -1.0
Selected after best SDR
K 80 90 129 129 70 1 100 129
SIR [dB] -3.4 -3.9 -5.4 -3.9 -3.5 5.2 -3.9 -0.5
SDR [dB] -2.3 -0.7 -0.5 -2.5 -2.3 0.0 -3.1 -0.9
1This parameter set is excluded while searching for the best parameters due to its strange behavior
over K
102 ANHANG B. MULTICHANNEL RESULTS
Appendix B
Multichannel results
B.1 walking
B.1.1 MF with the α-β-divergence, MWF as prefilter and
speech and noise codebook
K0 10 20 30 40 50 60
SD
R [d
B]
0
2
4
6
8
10
α=1; β=1α=1; β=0α=1; β=-1α=0.5; β=1α=2; β=1α=1; β=-2α=1; β=2
K0 10 20 30 40 50 60
SIR
[dB
]
18
20
22
24
26
Figure B.1: Walking-noise reduction by the α-β-divergence with MWF prefilter andspech and noise codebook
B.1. WALKING 103
Table B.1: Walking-noise reduction by the α-β-divergence with MWF prefilter andspeech and noise codebook
α;β 1.0; 1.0 1.0; 0.0 1.0; -1.0 0.5; 1.0 2.0; 1.0 1.0; -2.0 1.0; 2.0
Selected after best SIR
K 4 1 1 2 5 0 7
SIR [dB] 26.0 24.2 25.3 25.9 24.4 19.8 23.6
SDR [dB] 3.0 6.6 7.9 3.0 3.0 10.7 1.0
Selected after best SDR
K 0 0 0 0 0 0 0
SIR [dB] 19.8 19.8 19.8 19.8 19.8 19.8 19.8
SDR [dB] 10.7 10.7 10.7 10.7 10.7 10.7 10.7
B.1.2 MF with α-β-divergence and Euclidean predictive, MWF
as prefilter and noise codebook
K0 20 40 60 80 100 120
SD
R [d
B]
0
5
10
α=1; β=1α=1; β=0α=1; β=-1α=0.5; β=1α=2; β=1α=1; β=-2α=1; β=2
K0 20 40 60 80 100 120
SIR
[dB
]
16
18
20
22
Figure B.2: Walking-noise reduction by the α-β-divergence with MWF prefilter andnoise codebook
104
K0 20 40 60 80 100 120
SD
R [d
B]
4
6
8
10
K0 20 40 60 80 100 120
SIR
[dB
]
19.5
20
20.5
21
21.5
22
Figure B.3: Walking-noise reduction by Euclidean predictive MF
Table B.2: Waving-noise reduction by the α-β-divergence and Euclidean predictive,MWF as prefilter and noise codebook
α;β 1; 1 1; 0 1; -1 0.5; 1 2; 1 1; -2 1; 2 EUC pred.
Selected after best SIR
K 5 1 1 2 1 0 0 50
SIR [dB] 20.6 21.6 22.0 19.8 20.4 19.7 19.7 21.6
SDR [dB] 2.1 2.1 2.3 2.5 1.8 10.7 10.7 4.9
Selected after best SDR
K 0 0 0 0 0 0 0 0
SIR [dB] 19.7 19.7 19.7 19.7 19.7 19.7 19.7 19.8
SDR [dB] 10.7 10.7 10.7 10.7 10.7 10.7 10.7 10.7
B.1. WALKING 105
B.1.3 Multichannel matrix factorization with the α-β-divergence
K0 10 20 30 40 50 60
SD
R [d
B]
-0.5
0
0.5
1
α=1; β=1α=1; β=0α=1; β=-1α=0.5; β=1α=2; β=1α=1; β=-2α=1; β=2
K0 10 20 30 40 50 60
SIR
[dB
]
0
5
10
15
20
Figure B.4: Walking-noise reduction by the α-β-divergence for multichannel data withspeech and noise codebook
Table B.3: Walking-noise reduction by the α-β-divergence for multichannel data withspeech and noise codebook
α;β 1.0; 1.0 1.0; 0.0 1.0; -1.0 0.5; 1.0 2.0; 1.0 1.0; -2.0 1.0; 2.0
Selected after best SIR
K 30 26 6 6 15 1 10
SIR [dB] 21.2 23.4 19.0 21.1 20.9 4.4 21.8
SDR [dB] 0.0 0.0 0.4 0.3 0.5 0.7 0.6
Selected after best SDR
K 5 3 3 1 9 1 24
SIR [dB] 20.6 18.5 18.9 19.5 20.4 4.4 19.8
SDR [dB] 0.7 0.1 0.8 0.5 0.7 1.0
106
B.2 waving
B.2.1 PMWF as prefilter
10 log10
(µ)-60 -50 -40 -30 -20 -10 0 10 20 30 40 50
SIR
[dB
]
-10
0
10
20
10 log10
(µ)-60 -50 -40 -30 -20 -10 0 10 20 30 40 50
SD
R [d
B]
-10
-5
0
5
10 log10
(µ)-60 -50 -40 -30 -20 -10 0 10 20 30 40 50
SA
R [d
B]
10
20
30
Figure B.5: SIR,SDR and SAR of the parametric Wiener filter for the waving noise
B.2. WAVING 107
B.2.2 MF with the α-β-divergence, MWF as prefilter and
speech and noise codebook
K0 10 20 30 40 50 60
SIR
[dB
]
10
15
20
25
α=1; β=1α=1; β=0α=1; β=-1α=0.5; β=1α=2; β=1α=1; β=-2α=1; β=2
K0 10 20 30 40 50 60
SD
R [d
B]
0
2
4
6
Figure B.6: Waving-noise reduction by the α-β-divergence with MWF prefilter andspeech and noise codebook
Table B.4: Waving-noise reduction by the α-β-divergence with MWF prefilter
α;β 1.0; 1.0 1.0; 0.0 1.0; -1.0 0.5; 1.0 2.0; 1.0 1.0; -2.0 1.0; 2.0
Selected after best SIR
K 5 1 1 2 14 24 22
SIR [dB] 22.9 22.2 25.4 22.2 21.0 17.9 22.2
SDR [dB] 2.5 5.7 5.0 4.6 2.5 0.7 1.6
Selected after best SDR
K 0 0 0 0 0 0 0
SIR [dB] 15.7 15.7 15.7 15.7 15.7 15.7 15.7
SDR [dB] 6.2 6.2 6.2 6.2 6.2 6.2 6.2
108
B.2.3 MF with the α-β-divergence, MWF as prefilter and
noise codebook
K0 20 40 60 80 100 120
SD
R [d
B]
0
2
4
6
α=1; β=1α=1; β=0α=1; β=-1α=0.5; β=1α=2; β=1α=1; β=-2α=1; β=2
K0 20 40 60 80 100 120
SIR
[dB
]
12
14
16
18
20
22
Figure B.7: Waving-noise reduction by the α-β-divergence with MWF prefilter andnoise codebook
K0 20 40 60 80 100 120
SD
R [d
B]
4
6
K0 20 40 60 80 100 120
SIR
[dB
]
16
18
Figure B.8: Waving-noise reduction by Euclidean predictive MF
B.2. WAVING 109
Table B.5: Waving-noise reduction by the α-β-divergence and Euclidean predictivewith MWF prefilter and noise codebook
α;β 1; 1 1; 0 1; -1 0.5; 1 2; 1 1; -2 1 1; 2 EUC pred.
Selected after best SIR
K 3 1 129 4 8 1 0 129
SIR [dB] 16.1 17.6 20.0 16.0 17.0 23.7 15.6 18.7
SDR [dB] 1.5 1.6 1.6 1.6 0.9 0.0 6.2 3.3
Selected after best SDR
K 0 0 0 0 0 0 0 0
SIR [dB] 15.6 15.6 15.6 15.6 15.6 15.6 15.6 15.7
SDR [dB] 6.2 6.2 6.2 6.2 6.2 6.2 6.2 6.2
B.2.4 Multichannel matrix factorization with the α-β-divergence
K0 10 20 30 40 50 60
SD
R [d
B]
-8
-6
-4
-2
0
α=1; β=1α=1; β=0α=1; β=-1α=0.5; β=1α=2; β=1α=1; β=-2α=1; β=2
K0 10 20 30 40 50 60
SIR
[dB
]
0
10
20
Figure B.9: Waving-noise reduction by the α-β-divergence for multichannel data withspeech and noise codebook
1This parameter set is excluded while searching for the best parameters due to its strange behavior
over K
110
Table B.6: Waving-noise reduction by the α-β-divergence for multichannel data withspeech and noise codebook
α;β 1.0; 1.0 1.0; 0.0 1.0; -1.0 0.5; 1.0 2.0; 1.0 1.0; -2.0 1.0; 2.0
Selected after best SIR
K 9 6 4 4 65 3 65
SIR [dB] 26.1 26.3 27.0 27.3 26.1 -3.4 24.9
SDR [dB] 0.4 0.2 0.6 0.3 0.6 -3.2 0.7
Selected after best SDR
K 16 2 2 2 18 40 26
SIR [dB] 25.2 25.1 25.2 26.4 24.1 -12.1 21.1
SDR [dB] 0.5 0.5 0.6 0.5 1.1 -1.0 1.0
ANHANG C. OVERVIEW ON RELATED ALGORITHMS 111
Appendix C
Overview on related algorithms
C.0.5 Minimum Mean Square Error (MMSE) estimation
As it can for example be seen in [OV10], pages 140-160, the mean square error (MSE)
can be formulated as the following:
E{|X − x (y) |2|Y = y} =
∫
|x− x (y) |2fX|Y (x|y)dx (C.1)
with the conditional or a posteriori density fX|Y (x|y).
The estimate is obtained by taking the derivative with respect to x
∂
∂x
∫
|x− x (y) |2fX|Y (x|y)dx = −2
∫
x− x (y) fX|Y (x|y) dx!= 0 (C.2)
and therefore∫
x (y) fX|Y (x|y) dx = x (y) =
∫
xfX|Y (x|y)dx (C.3)
Finally by rewriting equation C.3 the estimate is given by the conditional expectation
x (y) = E{X|Y = y} (C.4)
In a similar manner the MMSE estimate for L observations Y1 = y1, ..., YL = yL can be
shown to be
x (y1, ..., yL) = E{X|Y1 = y1, ..., YL = yL} (C.5)
112
By grouping the measured random variables in vector Y and the measurements in y
the MMSE estimate writes as
x (y) = E{X|Y = y} (C.6)
For the calculation the conditional density fX|Y (x|Y = y) has to be known.
Similar to calculating an estimator for the known observation y an estimator can be
formulated for the random variables Y :
X = x (Y ) = E{X|Y } (C.7)
The estimator x has hence to minimize the mean square error for each y out of Y or
written as integral
∫
E{X|Y = y}fY (y) dy (C.8)
This is however given as x (y) minimizes the inner expectation and fY (y) is nonneg-
ative.
Linear Minimum Mean Square Error (LMMSE) estimation for jointly Gaus-
sian observations
As shown in the previous chapter, the MMSE estimate for the observation vector Y is
X (Y ) = E{X|Y } (C.9)
If observations Y and X are jointly Gaussian and of zero mean it can be shown, that
the optimum MMSE estimator is linear. Assuming that
X (Y ) = E{X|Y } = hHLMMSEY (C.10)
113
and by introducing the error of the estimate E = X − X = X − hHLMMSEY equation
C.9 rewrites as
X (X) = E{E + X|Y } = E{E|Y }+ E{X|Y } (C.11)
Furthermore E and Y are
• orthogonal (see [OV10]).
• of zero mean as E is a linear combination of Y , which is of zero mean.
• hence uncorrelated.
• jointly Gaussian as E is a linear combination of the jointly Gaussian Y .
• statistical independent as they are uncorrelated, of zero mean and jointly Gaus-
sian.
Due to the statistical independence of E and X,
E{E|Y } = 0 (C.12)
and finally
X (Y ) = E{X|Y } (C.13)
which is obviously fulfilled for the linear estimator. A linear MMSE filter is thus a
optimum filter regarding MMSE for jointly Gaussian observations.
The optimum LMMSE coefficients can be calculated by
hLMMSE = argminh
E{|X − hHY |2} (C.14)
with
∂
∂hHE{(
X − hHY) (
X∗ − Y Hh)
} = −2E{X(
X∗ − Y Hh)
}!= 0 (C.15)
114
In the end the LMMSE coefficients are given by
h = E{Y Y H}−1E{Y X∗} = Σ−1
Y Y ΣY X (C.16)
C.0.6 Wiener filtering
For Gaussian random processes, the LMMSE estimator leads to the well-known Wiener
filter. If the observations contain all previous samples, y(t) = [y(−∞), ..., y(t −
1), y(t)]T , the signal is thus semi-infinite, the filter is referred to as causal wiener filter
([Li06]) and MMSE estimator is calculated as
x(t) = hHy(t) with h = Σ−1Y Y ΣY X (C.17)
If the observations are given by a signal that is infinite on both sides, y(t) = [y(−∞), ..., y(t), ..., y(∞)]
the LMMSE can be referred to as non-causal wiener filter ([Li06]). The complete signal
can be estimated by
x(t) = hHy(t) with h = Σ−1Y Y ΣY X (C.18)
it can also be expressed in the Fourier domain as element wise multiplication
x(f) = H(f)y(f) with H(f) =SY X(f)
SY Y (f)(C.19)
where SY X and SY Y (f) denotes the cross- and power spectral density. Note that each
frequency is handled separately. One can conclude that the frequency therefore has to
be statistical independent. Furthermore the estimator uses the noisy phase.
The original Wiener theory cannot be used in practice, as their filters and the moments
the filter is composed of are of infinite length. For the causal Wiener filter it is sufficient
to assume that the estimate x(t) is only dependent on the last N observations.
The frequency domain wiener filter is usually approximated by using a DFT of length
N. At this point it may be noted, that the frequencies of the DFT of length N may no
longer be statistical independent (e.g. they may be correlated). The approximation
115
via the DFT of length N may therefore no longer be an optimum MMSE estimator
given the N observations. This was also pointed out by [LA04].
Furthermore to apply the filter, it is implied, that the second order moments are known
for each time instance. Strictly speaking, the optimum filter for the observations y(t)
is h(t). The way to estimate the moments strongly depends on the given scenario and
is therefore not discussed at this point.
C.0.7 Nonnegative Graph Embedding
In [YYF+08] the authors propose a general nonnegative data decomposition formu-
lation that combines NMF with the graph embedding framework (see [YXZ+07]and
later rewritten in [YXZZ05] )
G = {X,S} (C.20)
denote the graph G with vertex set Xand the similarity measure S. Furthermore the
diagonal matrix SD is defined as
SD,ii =∑
j 6=i
Sij∀i (C.21)
and the Laplacian L as
L = SD − S (C.22)
Also let the penalty graph be defined as GP = {XP ,SP}. Let A1 denote the desired
low-dimensional representation of the training data, the graph preserving problem can
be formulated as the following:
maxA1
∑
r1 6=r2
||A1r1−A1
r2||2SP
r1,r2= tr(A1LA1T ) (C.23)
(C.24)
minA1
∑
r1 6=r2
||A1r1−A1
r2||2Sr1,r2 = tr(A1LPA1T ) (C.25)
116
To combine the graph embedding with the NMF cost function, the authors divide the
activation matrix and dictionary in
A =
A1
A2
,D = [D1D2] (C.26)
A2, D2 represent the complementary space of A1, D1, such that their addition results
in the original data. The objective function in equation C.23 is equivalent to
(C.27)
minA2
∑
r1 6=r2
||A2r1−A2
r2||2Sr1,r2 = tr(A2LPA2T ) (C.28)
The (euclidean) non-negative graph embedding problem is then formulated as
minA,D
tr(A2LPA2T ) + tr(A1LPA1T ) + λ||X −AD||2F (C.29)
subject to: A,D > 0 (C.30)
Furthermore the authors prove that their exists a iterative optimization problem.
C.1 Terminology of non-blind source separation al-
gorithms
In [VJA+10] and [LGS+12] the terms informed source separation and blind source
separation are used. Later [LDDR13] further divided informed audio source separation
into model-based and signal-based, depending of the available side information. Signal-
based side information are then further on divided into score-informed and exemplar-
based source separation. Score-informed data would for example be information over
time whether a source is on or off. Exemplar-based source separation instead is based
on additional actual signals, such as cover tracks for music.
In [VBGB14] the authors suggest, that the term informed source separation is used
rather inconsistently. They therefore use strongly guided and weakly guided, depending
C.1. TERMINOLOGY OF NON-BLIND SOURCE SEPARATION ALGORITHMS117
on the type of side information.
In papers more related to machine learning, often the terms supervised and semi-
supervised are used [YYF+08] [CHH07] [Zhu05]. Semi-supervised learning techniques
are especially promising, as they can be initially trained with the labeled data, but
can increase their performance with the unlabeled data later on. In chapter 6 a NMF
training technique is referred to that exploits labeled and unlabeled data.
In our scenario all training data is labeled, in other words, there exists training data
for the speech and training data for the ego noise of the robot.
118 ANHANG D. DERIVATIONS
Appendix D
Derivations
D.1 Parameter estimation by matrix factorization
In chapter 2.5.1 NMF with the Itakura-Saito divergence was introduced as ML esti-
mate for the sum of Gaussian components. However, the presented model is not the
only statistical model, where the estimation of parameters is equivalent to solving a
matrix factorization problem. Furthermore, it has to be noted that some source mod-
els developed for audio source separation in the literature are not equivalent to a MF
problem. In [Att03] for example, Gaussian mixture models (GMMs) are used.
D.1.1 Euclidean distance as ML estimation of additive Gaus-
sian noise
In [Vir07] and [FC09] Euclidean distance is named as the ML estimator in the presence
of additive Gaussian noise. The difference to chapter 2.5.1 is that the mean of the
processes is modeled by the NMF parameters, whereas the variance is identical for all
components.
Yfr =K∑
k
Ck,fr with Ckfr ∼ N (DfkAkr, σ2) (D.1)
D.1. PARAMETER ESTIMATION BY MATRIX FACTORIZATION 119
The negative log-likelihood thus writes as
CML(D,A) = −∑
fr
log p(Yfr|Df ,Ar)
=∑
fr
log(π) + log(σ2/K) +(Yfr −DT
f Ar)2
σ2
c=
1
σ2
∑
fr
(Yfr −DTf Ar)
2 =1
σ2DEUC(Yfr,D
Tf Ar)
(D.2)
D.1.2 ML estimation of the sum of Poisson distributed mag-
nitude components
[FBD09] pointed out that by modifying the generative model of equation 2.13 into
|Yfr| =K∑
k
|Ck,fr| with |Ckfr| ∼ P(DfkAkr) (D.3)
and assuming that the absolute of each component is Poissonian distributed with
P(x|λ) = exp(−λ)λx
x!.
As for example shown in [GW], the sum of statistical independent Poisson random
variables is also Poisson distributed and its shape parameter is the sum of the individual
shape parameters, such that
|Yfr| ∼ P (
K∑
k
DfkAkr) (D.4)
Again assuming the components to be mutually independent and individually inde-
pendent across frequencies and frames, the negative log-likelihood estimator is given
120
by
CML(D,A) = −
fr∑
fr
log p(|Yfr||Df ,Ar)
=∑
fr
DTf Ar − |Yfr| log(D
Tf Ar) + log(|Yfr|!)
=
fr∑
fr
DTf Ar + |Yfr| log
|Yfr|
DTf Ar
− |Yfr| log |Yfr|+ log(|Yfr|!)
c=∑
fr
DTf Ar + |Yfr| log
|Yfr|
DTf Ar
− |Yfr| = CKL(|X|,AD)
(D.5)
This shows, that ML estimation of the sum of Poisson distributed magnitude compo-
nents is equivalent up to a constant with the matrix factorization of the magnitude
spectogram with the Kullback-Leibler divergence.
Note, that the factorial operator is only defined for non-negative integer values. [FBD09]
gives the interpretation for samples audio spectrograms by scaling them to integer val-
ues. For solving the optimization problem, the factorial operator vanishes anyway, as
the term |Yfr|! is independent of the parameters to be optimized.
One has to be clear, that the model described above no longer assumes Gaussian
sources. In fact the sources itself are not modeled directly. Their magnitude is modeled,
but no assumption is made on their phase. As the amplitudes of the components
in equation D.3 add up, this can be interpreted as if all components have the same
phase. Even more, if only the components amplitude is estimated, the phase remains
undetermined. One obvious estimate is for example used in [Vir07] and [Sma07], where
simply the phase of the mixture is used for all components, arg(ck,fr) = arg(Yfr) ∀k ∈
[1;K]. [FBD09] points out, that the approach from above is not conservative, the sum
of the estimated sources does not necessarily result in the observed signal.
D.1. PARAMETER ESTIMATION BY MATRIX FACTORIZATION 121
D.1.3 Itakura-Saito divergence as ML estimation of gamma
multiplicative noise
[FBD09] showed the connection between ML estimation in gamma multiplicative noise
and matrix factorization by the Itakura-Saito distance.
Given a generative model where the observations Y are a rank K matrix distorted by
multiplicative noise
Y = (DA).E (D.6)
If the multiplicative noise E is independent and identical distributed (i.i.d.) according
to
efr =Yfr
DfAr
∼ G(α, β) (D.7)
where G(x|α, β) = βα
Γ(α)xα−1e−βx with x ≥ 0 denotes the gamma distribution and Γ the
gamma function. Again using independence across frequency and time, the negative
log-likelihood for the parameters that form the estimate of the rank K matrix, is given
by
CML(D,A) = −∑
fr
log p(Yfr|Df ,Ar)
= −∑
fr
logG
(
Yfr
DfAr
|α, β
)
=∑
fr
−α log(β) + log(Γ(α))− (α− 1) log
(
Yfr
DfAr
)
+ βYfr
DfAr
c= β
∑
fr
α− 1
βlog
(
Yfr
DfAr
)
+Yfr
DfAr
− 1
(D.8)
If (α−1)β
= 1 the maximum likelihood is a scaled version of NMF with the Itakura-Saito
divergence:1
CML(D,A) = βCIS(Y ,DTA) (D.9)
1Note that the author writes the equivalence is given for αβ= 1
122
Note that in this model the observations Y are not specified (in contrast to the previous
chapters). It could therefore be applied to the magnitude, magnitude squared or any
other form of the observations. As the relation of Y to the actual observed signal is
not specified, there is obviously also no method defined to calculate the estimates on
the individual sources.
Nevertheless, this interpretation of matrix factorization with the Itakura-Saito distance
is interesting as the Itakura-Saito divergence is scale invariant.
D.2 Deriving the α-divergence as a Csisz’ars diver-
gence
Recall the Csisz’ars φ-divergence of equation 3.4
DC
(
X||X)
=∑
r,f
dφ
(
Xr,f , Xr,f
)
with dφ(x||y) = xφ(
y
x
)
For φ(z) =z(zν−1−1)ν(ν−1)
+ 1−zν
= zν−νz+ν−1ν(ν−1)
the resulting divergence
is
dν(x||y) = x
(
y
x
)ν− ν
(
y
x
)
+ ν − 1
ν(ν − 1)=
yνx1−ν − νy + (ν − 1)x
ν(ν − 1)(D.10)
In their later works ([CZPA09] and [CCA11]) the authors use a slightly different nota-
tion, which can be derived by replacing ν = 1− α:
dα(x||y) =y1−αxα − (1− α)y − αx
(1− α)(−α)=
y1−αxα + (α− 1)y − αx
(α− 1)α(D.11)
Note that equations D.10 and D.11 are both denoted generalized α-divergence, we
will use the second definition in this paper to ease comparison with the generalized
α-β-divergence in 3.1.3.
D.3. DERIVING THE β-DIVERGENCE AS A BREGMAN DIVERGENCE 123
The limits for α→ 0 and α→ 1 are usually reformulated. For α→ 0:
limα→0
dα(x||y) = limα→0
y1−αxα − y
−α+ lim
α→0
αy − αx
−α
= x− y + y limα→0
y−αxα − 1
−α= x− y + y lim
α→0
eα(ln(x)−ln(y))
−α
(D.12)
Using l’Hospital’s rule the limit rewrites as
limα→0
dα(x||y) = x− y + y limα→0
ln(x)− ln(y)eα(ln(x)−ln(y))
−1
= x− y + y(− ln(x) + ln(y)) = y ln(y
x
)
+ y − x
(D.13)
Similar for α→ 1:
limα→1
dα(x||y) = limα→1
(α− 1)(y − x)
α− 1+ x lim
α→1
y1−αxα−1 − 1
α− 1
= y − x+ x limα→1
(
xy
)α−1
− 1
α− 1
(D.14)
Again using l’Hospital’s rule the term simplifies to
limα→1
dα(x||y) = y − x+ x limα→1
ln(
xy
)
e(α−1) ln(xy )
1== y − x+ x ln
(
x
y
)
(D.15)
Using equation D.13 and D.15 the α-divergence can be written, as defined in 3.5 as
dα(x||y) =
1α(α−1)
(xαy1−α − αx+ (α− 1)y) α ∈ R \ {0, 1}
x ln xy+ y − x α = 1
y ln y
x+ x− y α = 0
D.3 Deriving the β-divergence as a Bregman diver-
gence
[HDB11] shows that the β-divergence can be derived from the Bregman divergences by
using ϕ′′β (x) = xβ−1, ϕ′
β (x) =xβ
βand ϕβ (x) =
xβ+1
(β+1)β.
dϕβ(x||y) = ϕβ (x)− ϕβ (y)− ϕ′
β (y) (x− y)
=xβ+1
(β + 1)β−
yβ+1
(β + 1)β−
yβ
β(x− y) =
xβ+1 + βyβ+1 − (β + 1)xyβ
β (β + 1)
(D.16)
124
Again two limits have to be determined. For β = −1 the limit of equation D.16 is
limβ→−1
dϕβ(x||y) = lim
β→−1
(β + 1)(−xyβ + yβ+1)
−(β + 1)+ lim
β→−1
xβ+1 − yβ+1
−(β + 1)
=x
y− 1− lim
β→−1
e(β+1) ln(x) − e(β+1) ln(y)
β + 1
(D.17)
Using l’Hospital’s rule the limit rewrites as
limβ→−1
dϕβ(x||y) =
x
y− 1− lim
β→−1
ln (x) e(β+1) ln(x) − ln (y) e(β+1) ln(y)
1
=x
y− 1− ln (x) + ln (y) = − ln
(
x
y
)
+x
y− 1
(D.18)
For β = 0 the limit of equation D.16 is
limβ→0
dϕβ(x||y) = lim
β→0
β(yβ+1 − xyβ)
β+ lim
β→0
xβ+1 − xyβ
β
= y − x+ x limβ→0
eβ ln(x) − eβ ln(y)
β
(D.19)
Again using l’Hospital’s rule the term rewrites as
limβ→0
dϕβ(x||y) = y − x+ x lim
β→0
ln (x) eβ ln(x) − ln (y) eβ ln(y)
(1)= y − x+ x ln
(
x
y
)
(D.20)
Finally the β-divergence can be written as defined already in equation 3.7 as
dβ (x||y) =
1β(β+1)
(
xβ+1 + βyβ+1 − (β + 1) xyβ)
β ∈ R \ {−1, 0}
x ln xy+ y − x β = 0
xy− ln x
y− 1 β = −1
D.4. DERIVATION OF MU RULES FOR THE α-β DIVERGENCE (SINGLECHANNEL) 125
D.4 Derivation of MU rules for the α-β divergence
(single channel)
Calculating the gradient of the generalized α-β-divergence for some parameters θ can
be formulated as the following:
∂Cα,β
(
X, X)
∂θ= −
∑
fr
∂
∂θdα,β
(
Xfr||Xfr
)
= −∑
fr
∂dα,β
(
Xfr||Xfr
)
∂Xfr
∂Xfr
∂θ
= −∑
fr
Xα+β−1fr
(
Xfr
Xfr
)α
− 1
α
∂Xfr
∂θ
(D.21)
[CCA11] introduces now the (1− α) deformed logarithm
ln1−α (z) =
zα−1α
for α 6= 0
ln (z) for α = 0(D.22)
such that
∂Cα,β
(
X, X)
∂θ= −
∑
fr
Xα+β−1fr ln1−α
(
Xfr
Xfr
)
∂Xfr
∂θ(D.23)
For matrix factorization the parameters θ = {A,D} form the estimate X = DA such
that
Xfr = DTf Ar =
∑
k
DfkAkr (D.24)
The gradients for the parameters are therefore given by
∂Cα,β
(
X, X)
∂Akr
= −∑
f
Xα+β−1fr ln1−α
(
Xfr
Xfr
)
Dfk (D.25)
∂Cα,β
(
X, X)
∂Dfk
= −∑
r
Xα+β−1fr ln1−α
(
Xfr
Xfr
)
Akr (D.26)
126
Instead of applying an gradient descent algorithm directly on the parameters, it is
applied to the natural parameter space φ (z) such that
Akr ← φ−1
φ (Akr)− ηkr∂Cα,β
(
X, X)
∂φ (Akr)
(D.27)
Dfk ← φ−1
φ (Dfk)− ηfk∂Cα,β
(
X, X)
∂φ (Dfk)
(D.28)
If the (1− α) deformed logarithm is used as transformation, φ (z) = ln1−α (z) the
inverse transform is the (1− α) deformed exponential:
φ−1 (z) = exp1−α (z) =
ez for α = 0
(1 + αz)1
α for 1 + αz ≥ 1
0 for 1 + αz < 1
(D.29)
Akr ← exp1−α
(
ln1−α (Akr) + ηkr∑
f
Xα+β−1fr ln1−α
(
Xfr
Xfr
)
∂Xfr
∂ ln1−α (Akr)
)
(D.30)
Dfk ← exp1−α
(
ln1−α (Dfk) + ηfk∑
r
Xα+β−1fr ln1−α
(
Xfr
Xfr
)
∂Xfr
∂ ln1−α (Dkr)
)
(D.31)
by using
exp1−α (ln1−α (z) + v) =
(
1 + α
(
zα − 1
α+ v
))1
α
= (1 + zα − 1 + αv)1
α
= z(
1 + αv
zα
) 1
α
= z exp1−α
( v
zα
)
(D.32)
equation D.30 can be reformulated to
Akr ← Akr exp1−α
(
1
Aαkr
ηkr∑
f
Xα+β−1fr ln1−α
(
Xfr
Xfr
)
∂Xfr
∂ ln1−α (Akr)
)
(D.33)
Dfk ← Dfk exp1−α
(
1
Dαfk
ηfk∑
r
Xα+β−1fr ln1−α
(
Xfr
Xfr
)
∂Xfr
∂ ln1−α (Dkr)
)
(D.34)
D.4. DERIVATION OF MU RULES FOR THE α-β DIVERGENCE (SINGLECHANNEL) 127
Inserting
∂Xfr
∂ ln1−α (Akr)=
∂
∂ ln1−α (Akr)exp1−α (ln1−α (Akr))Dfk = A1−α
kr Dfk (D.35)
∂Xfr
∂ ln1−α (Dfk)= AkrD
1−αfk (D.36)
this simplifies to
Akr ← Akr exp1−α
(
ηkr∑
f
A1−2αkr DfkX
α+β−1fr ln1−α
(
Xfr
Xfr
))
(D.37)
Dfk ← Dfk exp1−α
(
ηfk∑
r
AkrD1−2αfk Xα+β−1
fr ln1−α
(
Xfr
Xfr
))
(D.38)
By selecting suitable step-sizes
ηkr =A2α−1
kr∑
f DfkXα+β−1fr
(D.39)
ηfk =D2α−1
fk∑
r AkrXα+β−1fr
(D.40)
the MU rules rewrite as
Akr ← Akr exp1−α
(
∑
f
DfkXα+β−1fr
∑
f DfkXα+β−1fr
ln1−α
(
Xfr
Xfr
))
(D.41)
Dfk ← Dfk exp1−α
(
∑
r
AkrXα+β−1fr
∑
r AkrXα+β−1fr
ln1−α
(
Xfr
Xfr
))
(D.42)
For α 6= 0 the generalized MU rules simplify to
Akr ← Akr
1 + α
∑
f
DfkXα+β−1fr
∑
f DfkXα+β−1fr
(
Xfr
Xfr
)α
− 1
α
1
α
= Akr
(
∑
f DfkXβ−1fr Xα
fr∑
f DfkXα+β−1fr
)1
α
(D.43)
Dfk ← Dfk
(
∑
r AkrXβ−1fr Xα
fr∑
r AkrXα+β−1fr
)1
α
(D.44)
128
Rewriting as matrix multiplications this leads to
Akr ← Akr
(
DTk (X
.[β−1]r .Xα
r )
DTk (Xr).[α+β−1]
)1
α
(D.45)
Dfk ← Dfk
(
(X.[β−1]f .X
.[α]f )TAk
(XTf )
.[α+β−1]Ak
)1
α
(D.46)
and finally to the form of equation 3.18
A← A.
(
DT (X .[β−1].X .[α])
DT X .[α+β−1]
).[ 1α]
D ←D.
(
(X .[β−1].X .[α]AT
X .[α+β−1]AT
).[ 1α]
D.5 Derivation of a multichannel EM algorithm for
the Itakura-Saito divergence
CML(G,D,A) =
fr∑
fr
log(det(ΣB,f )) + (Yfr −GC,fCfr)HΣ−1
B,f(Yfr −GC,fCfr)
+∑
k
log(Df,kAk,r) +|Ck,fr|
2
Df,kAk,r
By using GC,fCfr = GS,fSfr and adding the trace to the scalar expression above
equation can be reformulated as
CML(G,D,A) =∑
fr
log(det(ΣB,f)) +∑
k
log(Df,kAk,r) +|Ck,fr|2
Df,kAk,r
+R∑
f
tr(1
R
∑
r
Σ−1B,fYfrY
Hfr +Σ−1
B,fGS,fSfrYHfr
+Σ−1B,fYfrS
HfrG
HS,f +Σ−1
B,fGS,fSfrSHfrG
HS,f)
(D.47)
D.5. DERIVATION OF A MULTICHANNEL EM ALGORITHM FOR THEITAKURA-SAITO DIVERGENCE 129
By defining the spatial covariance matrices,
RY Y ,f =1
R
∑
r
YfrYHfr (D.48)
RY S,f =1
R
∑
r
YfrSHfr (D.49)
RSS,f =1
R
∑
r
SfrSHfr (D.50)
equation D.47 rewrites as
CML(G,D,A) =∑
fr
log(det(ΣB,fr)) +∑
k
log(Df,kAk,r) +|Ck,fr|2
Df,kAk,r
(D.51)
+R∑
f
tr(Σ−1B,f(RY Y ,f +GS,fR
HY S,f +RY S,fG
HS,f +GS,fRSS,fG
HS,f)) (D.52)
By setting the gradient to zero, Dfk can be calculated as
∂CML(G,D,A)
∂Dfk
=∑
r
1
Dfk
−|Ckfr|2
(DfkAkr)2Akr
!= 0 (D.53)
By equation D.53 and in similar manner for the other parameters, the parameter
estimates result as presented in chapter 3.3.2 as equation 3.27 to
Dfk =1
R
∑
r
|Ckfr|2
Akr
∆=
1
R
∑
r
Ukfr
Akr
Akr =1
F
∑
f
Ukfr
Dfk
GS,f = RY S,fR−1SS,f
and by considering that ΣB,f has to be diagonal and ∂V −1U∂V
= −(V −1UV −1)T and
∂ log |det(V |V
= (V −1)T
ΣB,f = diag(RY Y ,f +GS,fRHY S,f +RY S,fG
HS,f +GS,fRSS,fG
HS,f)
130
D.6 Computational growth for parallel matrix in-
version
The EM algorithm proposed by [OF10] a huge number of M ×M matrices have to be
inverted. This chapter shows that the computational complexity severely growth from
2 to 4 microphones.
The inversion of a 2 × 2 matrix A =
a11 a12
a21 a22
is simply given by only using multi-
plications and additions as
det(A) = a11a22 − a12a21 (D.54)
A−1 =1
det(A)
a22 −a21
−a12 a11
(D.55)
Thus it can be calculated via matrices for all time-frequency bins at once. So for each
time-frequency bin, the determinant has to be calculated by 2 multiplication and 1
subtraction and again 4 multiplication for the scaling of the entries. In total this gives
6 multiplications and 1 subtraction.
A similar formulation can be made for 4 × 4 matrices 2. The calculation of the deter-
minant requires then 3 ∗ 24 multiplications/divisions and 10 additions/subtractions. It
requires furthermore 16 multiplications for scaling and for each of the 16 components
12 multiplications and 5 additions. So in total it requires 280 multiplications and 103
additions/subtractions.
2for example with the formula at http://www.cg.info.hiroshima-
cu.ac.jp/ miyazaki/knowledge/teche23.html
BIBLIOGRAPHY 131
Bibliography
[Att03] Attias, Hagai: New EM Algorithms for Source Separation and Deconvo-
lution with a Microphone Array, 2003
[BDBG03] Benaroya, Laurent ; Donagh, Lorcan M. ; Bimbot, Frdric ; Gribon-
val, Rmi: Non negative sparse representation for wiener based source
separation with a single sensor. In: in Proc. IEEE Int. Conf. on Acous-
tics, Speech and Signal Processing (ICASSP, 2003, S. 613–616
[BW01] Brandstein, Michael ; Ward, Darren: Microphone arrays: signal pro-
cessing techniques and applications. Springer Science & Business Media,
2001
[Car98a] Cardoso, Jean-Fran C.: Blind signal separation: statistical principles.
In: Proceedings of the IEEE 86 (1998), Nr. 10, S. 2009–2025
[Car98b] Cardoso, Jean-Francois: Multidimensional independent component
analysis. In: Acoustics, Speech and Signal Processing, 1998. Proceedings
of the 1998 IEEE International Conference on Bd. 4 IEEE, 1998, S. 1941–
1944
[Car01] Cardoso, Jean-Francois: The three easy routes to independent compo-
nent analysis; contrasts and geometry. In: Proc. ICA Bd. 2001, 2001
[CAZ+06] Cichocki, Andrzej ; Amari, Shun-ichi ; Zdunek, Rafal ; Kompass,
Raul ; Hori, Gen ; He, Zhaohui: Extended SMART algorithms for
132
non-negative matrix factorization. In: Artificial Intelligence and Soft
Computing–ICAISC 2006. Springer, 2006, S. 548–562
[CCA11] Cichocki, Andrzej ; Cruces, Sergio ; Amari, Shun-ichi: Generalized
alpha-beta divergences and their application to robust nonnegative matrix
factorization. In: Entropy 13 (2011), Nr. 1, S. 134–170
[CHH07] Cai, Deng ; He, Xiaofei ; Han, Jiawei: Semi-supervised discriminant
analysis. In: Computer Vision, 2007. ICCV 2007. IEEE 11th International
Conference on IEEE, 2007, S. 1–7
[Com94] Comon, Pierre: Independent component analysis, a new concept? In:
Signal processing 36 (1994), Nr. 3, S. 287–314
[CZA06] Cichocki, Andrzej ; Zdunek, Rafal ; Amari, Shun ichi: Csiszrs diver-
gences for non-negative matrix factorization: Family of new algorithms.
In: LNCS, Springer, 2006, S. 32–39
[CZO87] Cox, Henry ; Zeskind, Robert M. ; Owen, Mark M.: Robust adaptive
beamforming. In: Acoustics, Speech and Signal Processing, IEEE Trans-
actions on 35 (1987), Nr. 10, S. 1365–1376
[CZPA09] Cichocki, Andrzej ; Zdunek, Rafal ; Phan, Anh H. ; Amari, Shun-ichi:
Nonnegative matrix and tensor factorizations: applications to exploratory
multi-way data analysis and blind source separation. John Wiley & Sons,
2009
[D+09] Dikmen, Onur u. a.: Unsupervised single-channel source separation us-
ing Bayesian NMF. In: Applications of Signal Processing to Audio and
Acoustics, 2009. WASPAA’09. IEEE Workshop on IEEE, 2009, S. 93–96
[DK14] Deleforge, Antoine ; Kellermann, Walter: Phase-Optimized
K-SVD for Signal Extraction from Underdetermined Multi-
BIBLIOGRAPHY 133
channel Sparse Mixtures. In: CoRR abs/1410.2430 (2014).
http://arxiv.org/abs/1410.2430
[DS05] Dhillon, Inderjit S. ; Sra, Suvrit: Generalized nonnegative matrix ap-
proximations with Bregman divergences. In: In: Neural Information Proc.
Systems, 2005, S. 283–290
[DVG10] Duong, Ngoc Q. ; Vincent, Emmanuel ; Gribonval, Remi: Under-
determined reverberant audio source separation using a full-rank spatial
covariance model. In: Audio, Speech, and Language Processing, IEEE
Transactions on 18 (2010), Nr. 7, S. 1830–1840
[EF13] Essid, Slim ; Fevotte, Cedric: Smooth nonnegative matrix factorization
for unsupervised audiovisual document structuring. In: Multimedia, IEEE
Transactions on 15 (2013), Nr. 2, S. 415–425
[EK01] Eguchi, Shinto ; Kano, Yutaka: Robustifying maximum likelihood es-
timation. In: Tokyo Institute of Statistical Mathematics, Tokyo, Japan,
Tech. Rep (2001)
[FBD09] Fevotte, Cedric ; Bertin, Nancy ; Durrieu, Jean-Louis: Nonnegative
Matrix Factorization with the Itakura-saito Divergence: With Application
to Music Analysis. Cambridge, MA, USA : MIT Press, Marz 2009. – ISSN
0899–7667, 793–830
[FC09] Fevotte, Cedric ; Cemgil, A T.: Nonnegative matrix factorizations as
probabilistic inference in composite models. In: Signal Processing Confer-
ence, 2009 17th European IEEE, 2009, S. 1913–1917
[GW] Grimmett, Geoffrey ; Welsh, Dominic: Probability: An Introduction
[HDB11] Hennequin, Romain ; David, Bertrand ; Badeau, Roland: Beta-
divergence as a subclass of bregman divergence. In: IEEE Signal Pro-
cessing Letters 18 (2011), Nr. 2, S. 83–86
134
[HO00] Hyvarinen, Aapo ; Oja, Erkki: Independent component analysis: algo-
rithms and applications, Elsevier, 2000, S. 411–430
[Inc11] Ince, Gokhan: Ego Noise Estimation for Robot Audition, 2011
[KSRS11] Kumar, Kshitiz ; Singh, Rita ; Raj, Bhiksha ; Stern, Richard: Gam-
matone sub-band magnitude-domain dereverberation for ASR. In: Acous-
tics, Speech and Signal Processing (ICASSP), 2011 IEEE International
Conference on IEEE, 2011, S. 4604–4607
[LA04] Li, Chunjian ; Andersen, Søren V.: Inter-frequency dependency in
MMSE speech enhancement. In: Proceedings of the 6th Nordic Signal
Processing Symposium Citeseer, 2004
[LBD+14] Loellmann, Heinrich W. ; Barfuss, Hendrik ; Deleforge, Antoine ;
Meier, Stefan ; Kellermann, Walter: Challenges in acoustic signal en-
hancement for human-robot communication. In: Speech Communication;
11. ITG Symposium; Proceedings of VDE, 2014, S. 1–4
[LDDR13] Liutkus, Antoine ; Durrieu, Jean-Louis ; Daudet, Laurent ; Richard,
Guilhem: An overview of informed audio source separation. In: Image
Analysis for Multimedia Interactive Services (WIAMIS), 2013 14th Inter-
national Workshop on IEEE, 2013, S. 1–4
[LGS+12] Liutkus, Antoine ; Gorlow, Stanislaw ; Sturmel, Nicolas ; Zhang,
Shuhua ; Girin, Laurent ; Badeau, Roland ; Daudet, Laurent ; Marc-
hand, Sylvain ; Richard, Gael: Informed audio source separation: A
comparative study. In: Signal Processing Conference (EUSIPCO), 2012
Proceedings of the 20th European IEEE, 2012, S. 2397–2401
[LHZC01] Li, Stan Z. ; Hou, Xin W. ; Zhang, HongJiang ; Cheng, QianSheng:
Learning spatially localized, parts-based representation. In: Computer
BIBLIOGRAPHY 135
Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the
2001 IEEE Computer Society Conference on Bd. 1 IEEE, 2001, S. I–207
[Li06] Li, Chunjian: Non-Gaussian, Non-stationary and Nonlinear Signal Pro-
cessing Methods-with Applications to Speech Processing and Channel Es-
timation, Videnbasen for Aalborg UniversitetVBN, Aalborg Universite-
tAalborg University, Det Teknisk-Naturvidenskabelige FakultetThe Fac-
ulty of Engineering and Science, Multimedia Information and Signal Pro-
cessingMultimedia Information and Signal Processing, 2006
[LLP12] Li, Liangda ; Lebanon, Guy ; Park, Haesun: Fast bregman divergence
nmf using taylor expansion and coordinate descent. In: Proceedings of the
18th ACM SIGKDD international conference on Knowledge discovery and
data mining ACM, 2012, S. 307–315
[LN11] Low, Siow Y. ; Nordholm, Sven: An insight into the parametric mul-
tichannel wiener formulation. In: Computer Science and Automation En-
gineering (CSAE), 2011 IEEE International Conference on Bd. 1 IEEE,
2011, S. 513–516
[LN13] Li, Yifeng ; Ngom, Alioune: Versatile sparse matrix factorization and
its applications in high-dimensional biological data analysis. In: Pattern
Recognition in Bioinformatics. Springer, 2013, S. 91–101
[LS00] Lee, Daniel D. ; Seung, H. S.: Algorithms for Non-negative Matrix
Factorization. In: In NIPS, MIT Press, 2000, S. 556–562
[MH14] Mirsamadi, Seyedmahdad ; Hansen, John H.: Multichannel speech
dereverberation based on convolutive nonnegative tensor factorization for
ASR applications. In: Fifteenth Annual Conference of the International
Speech Communication Association, 2014
136
[OF10] Ozerov, Alexey ; Fevotte, Cedric: Multichannel nonnegative matrix
factorization in convolutive mixtures for audio source separation. In: Au-
dio, Speech, and Language Processing, IEEE Transactions on 18 (2010),
Nr. 3, S. 550–563
[OV10] Oppenheim, Alan ; Verghese, George: 6.011 Introduction to Commu-
nication Control and Signal Processing, Massachusetts Institute of Tech-
nology: MIT OpenCourseWare, Spring 2010
[OVB12] Ozerov, Alexey ; Vincent, Emmanuel ; Bimbot, Frederic: A general
flexible framework for the handling of prior information in audio source
separation. In: Audio, Speech, and Language Processing, IEEE Transac-
tions on 20 (2012), Nr. 4, S. 1118–1133
[PC01] Pham, Dinh-Tuan ; Cardoso, Jean-Francois: Blind separation of instan-
taneous mixtures of nonstationary sources, IEEE, 2001, S. 1837–1848
[SBA10] Souden, Mehrez ; Benesty, Jacob ; Affes, Sofiene: On optimal
frequency-domain multichannel linear filtering for noise reduction. In: Au-
dio, Speech, and Language Processing, IEEE Transactions on 18 (2010),
Nr. 2, S. 260–276
[SFM+14] Smaragdis, Paris ; Fevotte, Cedric ; Mysore, Gautham J. ; Mo-
hammadiha, Nasser ; Hoffman, Matthias: Static and dynamic source
separation using nonnegative factorizations: A unified view. In: Signal
Processing Magazine, IEEE 31 (2014), Nr. 3, S. 66–75
[SLOVB14] Souviraa-Labastie, Nathan ; Olivero, Anaik ; Vincent, Emmanuel
; Bimbot, Frederic: Audio source separation using multiple deformed ref-
erences. In: Signal Processing Conference (EUSIPCO), 2014 Proceedings
of the 22nd European IEEE, 2014, S. 311–315
BIBLIOGRAPHY 137
[SLVB15] Souviraa-Labastie, Nathan ; Vincent, Emmanuel ; Bimbot,
Frederic: Music separation guided by cover tracks: designing the joint
NMF model. In: 40th IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP) 2015. Brisbane, Australia, April 2015
[Sma07] Smaragdis, Paris: Convolutive Speech Bases and Their Ap-
plication to Supervised Speech Separation. In: Trans. Au-
dio, Speech and Lang. Proc. 15 (2007), Januar, Nr. 1, 1–
12. http://dx.doi.org/10.1109/TASL.2006.876726. – DOI
10.1109/TASL.2006.876726. – ISSN 1558–7916
[SMW04] Spriet, Ann ; Moonen, Marc ; Wouters, Jan: Spatially pre-processed
speech distortion weighted multi-channel Wiener filtering for noise reduc-
tion. In: Signal Processing 84 (2004), Nr. 12, S. 2367–2387
[TKWB11] Thurau, Christian ; Kersting, Kristian ; Wahabzada, Mirwaes ;
Bauckhage, Christian: Convex non-negative matrix factorization for
massive datasets. In: Knowledge and information systems 29 (2011), Nr.
2, S. 457–478
[VAT+12] Vincent, Emmanuel ; Araki, Shoko ; Theis, Fabian ; Nolte, Guido ;
Bofill, Pau ; Sawada, Hiroshi ; Ozerov, Alexey ; Gowreesunker,
Vikrham ; Lutter, Dominik ; Duong, Ngoc Q.: The signal separa-
tion evaluation campaign (2007–2010): Achievements and remaining chal-
lenges. In: Signal Processing 92 (2012), Nr. 8, S. 1928–1936
[VBGB14] Vincent, Emmanuel ; Bertin, Nancy ; Gribonval, Remi ; Bimbot,
Frederic: From blind to guided audio source separation: How models and
side information can improve the separation of sound. In: IEEE Signal
Processing Magazine 31 (2014), Nr. 3, S. 107–115
[VG+08] Virtanen, Tuomas ; Godsill, Simon u. a.: Bayesian extensions to non-
negative matrix factorisation for audio signal modelling. In: Acoustics,
138
Speech and Signal Processing, 2008. ICASSP 2008. IEEE International
Conference on IEEE, 2008, S. 1825–1828
[VGA09] Vasiloglou, Nikolaos ; Gray, Alexander G. ; Anderson, David V.:
Non-negative Matrix Factorization, Convexity and Isometry. In: SDM
SIAM, 2009, S. 673–684
[Vir07] Virtanen, Tuomas: Monaural sound source separation by nonnegative
matrix factorization with temporal continuity and sparseness criteria. In:
IEEE Trans. On Audio, Speech and Lang. Processing (2007)
[VJA+10] Vincent, Emmanuel ; Jafari, Maria G. ; Abdallah, Samer A.
; Plumbley, Mark D. ; Davies, Mike E.: Probabilistic modeling
paradigms for audio source separation, Hershey: IGI Global, 2010, S. 162–
185
[VZ12] Van Zaen, Jerome: Efficient schemes for adaptive frequency tracking
and their relevance for EEG and ECG. (2012)
[WJ04] Wang, Yuan ; Jia, Yunde: Fisher non-negative matrix factorization for
learning local features. In: In Proc. Asian Conf. on Comp. Vision Citeseer,
2004
[YXZ+07] Yan, Shuicheng ; Xu, Dong ; Zhang, Benyu ; Zhang, Hong-Jiang ;
Yang, Qiang ; Lin, Stephen: Graph embedding and extensions: a general
framework for dimensionality reduction. In: Pattern Analysis and Machine
Intelligence, IEEE Transactions on (2007), Nr. 1, S. 40–51
[YXZZ05] Yan, Shuicheng ; Xu, Dong ; Zhang, Benyu ; Zhang, Hong-Jiang:
Graph embedding: A general framework for dimensionality reduction. In:
Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Com-
puter Society Conference on IEEE, 2005, S. 830–837
BIBLIOGRAPHY 139
[YYF+08] Yang, Jianchao ; Yang, Shuicheng ; Fu, Yun ; Li, Xuelong ; Huang,
Thomas: Non-negative graph embedding. In: Computer Vision and Pat-
tern Recognition, 2008. CVPR 2008. IEEE Conference on IEEE, 2008, S.
1–8
[Zhu05] Zhu, Xiaojin: Semi-Supervised Learning Literature Survey / Computer
Sciences, University of Wisconsin -Madison. 2005 (1530). – Forschungs-
bericht