Singular Value Decomposition & Independent Component...
Transcript of Singular Value Decomposition & Independent Component...
-
HST582J/6.555J/16.456J Biomedical Signal and Image Processing Spring 2005
Singular Value Decomposition &
Independent Component Analysis
G.D. Clifford
March 11, 2005
Introduction
In this chapter we will examine how we can generalize the idea of transforming a time
series in an alternative representation, such as the Fourier (frequency) domain, to facili-
tate systematic methods of either removing (filtering) or adding (interpolating) data. In
particular, we will examine the techniques of Principal Component Analysis (PCA) using
Singular Value Decomposition (SVD), and Independent Component Analysis (ICA). Bothof these techniques utilize a representation of the data in a statistical domain rather than
a time or frequency domain. That is, the data is projected onto a new set of axes that
fulfill some statistical criterion, which imply independence, rather than a set of axes that
represent discrete frequencies such as with the Fourier transform.
Another important difference between these statistical techniques and Fourier-based tech-
niques is that the Fourier components onto which a data segment is projected are fixed,
whereas PCA- or ICA-based transformations depend on the structure of the data being ana-
lyzed. The axes onto which the data are projected are therefore discovered. If the structure
of the data changes over time, then the subspace onto which the data is projected will
change too.
Any projection onto another space is essentially a method for separating the data out into
separate sources which will hopefully allow us to see important structure in a particu-
lar projection. For example, by calculating the power spectrum of a segment of data, we
hope to see peaks at certain frequencies. The power (amplitude squared) along certain
frequency vectors is therefore high, meaning we have a strong component in the signal
at that frequency. By discarding the projections that correspond to the unwanted sources
(such as the noise or artifact sources) and inverting the transformation, we effectively per-
form a filtering of the signal. This is true for both ICA and PCA as well as Fourier-based
1
-
techniques. However, one important difference between these techniques is that Fourier
techniques assume that the projections onto each frequency component are independent
of the other frequency components. In PCA and ICA we attempt to find a set of axes which
are independent of one another in some sense. We assume there are a set of independent
sources in the data, but do not assume their exact properties. (Therefore, they may overlap
in the frequency domain in contrast to Fourier techniques.) We then define some measure
of independence and attempt to decorrelate the data by maximizing this measure. Since
we discover, rather than define the the axes, this process is known as blind source sep-aration. For PCA the measure we use to discover the axes is variance and leads to a set
of orthogonal axes. For ICA this measure is based on Gaussianity (such as kurtosis) andthe axes are not necessarily orthogonal. Kurtosis is the fourth moment (mean, variance,
and skewness are the first three) and is a measure of how non-Gaussian a distribution
is. Our assumption is that if we maximize the non-Gaussianity of a set of signals, then
they are maximally independent. (This comes from the central limit theorem; if we keep
adding independent signals together, we will eventually arrive at a Gaussian distribution.)
If we break a Gaussian-like observation down into a set of non-Gaussian mixtures, each
with distributions that are as non-Gaussian as possible, the individual signals will be in-
dependent. Therefore, kurtosis allows us to separate non-Gaussian independent sources,
whereas variance allows us to separate independent Gaussian noise sources.
This simple idea, if formulated in the correct manner, can lead to some surprising results,
as you will discover in the applications section later in these notes and in the accompa-
nying laboratory. However, we shall first map out the mathematical structure required to
understand how these independent subspaces are discovered and what this means about
our data. We shall also examine the assumptions we must make and what happens when
these assumptions break down.
12.1 Signal & noise separation
In general, an observed (recorded) time series comprises of both the signal we wish to an-
alyze and a noise component that we would like to remove. Noise or artifact removal often
comprises of a data reduction step (filtering) followed by a data reconstruction technique
(such as interpolation). However, the success of the data reduction and reconstruction
steps is highly dependent upon the nature of the noise and the signal.
By definition, noise is the part of the observation that masks the underlying signal we wish
to analyze1, and in itself adds no information to the analysis. However, for a noise signal
to carry no information, it must be white with a flat spectrum and an autocorrelation
function (ACF) equal to an impulse2. Most real noise is not really white, but colored
in some respect. In fact, the term noise is often used rather loosely and is frequently
used to describe signal contamination. For example, muscular activity recorded on the
1it lowers the SNR!2Therefore, no one-step prediction is possible. This type of noise can be generated in MATLAB with the
��������� function.
2
-
electrocardiogram (ECG) is usually thought of as noise or artifact. However, increased
muscle artifact on the ECG actually tells us that the subject is more active than when little
or no muscle noise is present. Muscle noise is therefore a source of information about
activity, although it reduces the amount of information about the cardiac cycle. Signal
and noise definitions are therefore task-related and change depending on the nature of the
information you wish to extract from your observations.
In order to distinguish between random noise and such information-carrying contaminants
we shall refer to these as artifacts. While noise is stochastic in nature, with the same char-
acteristic spectrum over a wide range of frequencies, artifacts tend to be more deterministic
in nature and confined to a specific band of the spectrum.
Table 1 illustrates the range of signal contaminants for the ECG3. We shall also exam-
ine the statistical qualities of these contaminants in terms of their probability distribution
functions (PDFs) since the power spectrum of a signal is not always sufficient to charac-
terize a signal. The shape of a PDF can be described in terms of its Gaussianity, or rather,
departures from this idealized form (which are therefore called super- or sub-Gaussian).
The fact that these signals are not Gaussian turns out to be an extremely important qual-
ity, which is closely connected to the concept of independence, which we shall exploit to
separate contaminants form the signal
Although noise is often modeled as linear Gaussian white noise4, this is often not the case.
Noise is often correlated (with itself or sometimes the signal), or concentrated at certain
values. For example, 50Hz or 60Hz mains noise contamination is sinusoidal, a waveform
that spends most of its time at the extreme values (near its turning points). By considering
departures from the ideal Gaussian noise model we will see how conventional techniques
can under-perform and how more sophisticated (statistical-based) techniques can provide
improved filtering.
We will now explore how this is simply another form of data reduction (or filtering)
through projection onto a new set of basis functions followed by data reconstruction
through projection back into the original space. By reducing the number of basis func-
tions we filter the data (by discarding the projections onto axes that are believed to cor-
respond to noise). By projecting from a reduced set of basis functions (onto which the
data has been compressed) back to the original space, we perform a type of interpolation
(by adding information from a model that encodes some of our prior beliefs about the
underlying nature of the signal or is derived from a data set).
3Throughout this chapter we shall use the ECG as a descriptive example because it has easily recognizable
(and definable) features and contaminants.4generated in MATLAB by the function ��� ��� � �
3
-
Figure 1: 10 seconds of 3 Channel ECG. Note the high amplitude movement artifact in the
first two seconds and the�������
second. Note also the QRS-like artifacts around 2.6 and 5.1
seconds
TIME (S) �
Qualities � Frequency TimeContaminant � Range duration
50 or 60HZ Powerline Narrowband 50 Continuous
or ��
� HzMovement Baseline Narrowband Transient or
Wander ( � ����� Hz) ContinuousMuscle Noise Broadband Transient
Electrical Interference Narrowband Transient or
Continuous
Electrode pop Narrowband Transient
Observation noise Broadband Continuous
Quantization noise Broadband Continuous
Table 1: Contaminants on the ECG and their nature.
4
-
12.2 Matrix transformations as filters
The simplest filtering of a time series involves the transformation of a one dimensional
( � � � ) time series of � points ����������������������� � � � ������� into a new representation, ��������� �!�"� �#� � � � �$���%� . If ������&���� � � � � � � � ���'� is a column vector5 that represents a channel ofECG, then we can generalize this representation so that � channels of ECG ( , and theirtransformed representation ) are given by
(*�+,,,-
�.� �.�� /�/�/ �.10���� ����� /�/�/ ���20...
......���34���5� /�/�/6���0
79888: � )��
+,,,-
�;� �;�� /�/�/ �;10�#�� �#��� /�/�/ �#�20...
......�$�3?� matrices (i.e. the signal is � -dimensional with � samples for eachvector). A transformation matrix @ can then be applied to ( to create the transformedmatrix ) such that )��A@B( � (2)The purpose of a transformation is to highlight patterns in the data, identify interesting
projections and (if we are filtering) discard the uninteresting parts of the signal (which
are masking the information we are interested in). In general, transforms can be cate-
gorized as orthogonal or biorthogonal transforms. For orthogonal transformations, thetransformed signal is same length as the original and the energy of the data is unchanged.
An example of this is the Fourier transform where the same signal is measured along a
new set of perpendicular axes corresponding to the coefficients of the Fourier series6.
For biorthogonal transforms, lengths and angles may change and the new axes are not
necessarily perpendicular. However, no information is lost and perfect reconstruction of
the original signal is still possible (using (*��@DCFE�) ).Transformations can be further categorized as as either lossless (so that the transformationcan be reversed and the original data restored exactly) or as lossy. When a signal isfiltered or compressed (through downsampling for instance), information is often lost and
the transformation is not invertible. In general, lossy transformations involve a mapping
of the data into a representation using a transformation matrix that contains some zero
entries, and therefore there is an irreversible removal of some of the data. Often this
corresponds to a mapping to a lower number of dimensions.
In the following sections we will study two transformation techniques Singular Value De-
composition (SVD), and Independent Component Analysis (ICA). We shall see how SVD
imposes an orthogonal transform, whereas ICA results in a biorthogonal transform on the
5In G �IHKJ ��L the command MNGPO�QSR!T�UWV�X �1Y ) gives a dimension of Z\[?] and a length equal to ^ .6there is also a class of wavelets, known as orthogonal wavelets that fall into the same class
5
-
data. Both techniques can which can be used to perform lossy or lossless transforma-
tions. Lossless SVD and ICA both involve projecting the data onto a set of axes which are
determined by the nature of the data, and is therefore methods of blind source separa-tion (BSS). (Blind because the axes of projection and therefore the sources are determined
through the application of an internal measure and without the use of any prior knowledge
of the data structure.)
However, once we have discovered the basis functions of the independent axes in the data
and have separated them out by projecting the data onto these axes, we can then use these
techniques to filter the data. By setting columns of the SVD and ICA separation matrices to
zero that correspond to unwanted sources we produce noninvertible matrices7. If we then
force the inversion of the separation matrix8 and transform the data back into the original
observation space, we can remove the unwanted source from the original signal.
In the Fourier transform, the basis functions for the new representation are predefined and
assumed to be independent, whereas in the SVD the representation the basis functions are
found in the data by looking for a set of basis functions that are independent. That is,
the data undergoes a decorrelation using variance as the metric (and therefore the basis
functions are independent in a second order sense).
The basic idea in the application of PCA to a data set, is to find the components � , ��� ,..., � 0so that they explain the maximum amount of variance possible by � linearly transformedcomponents. PCA can be defined in an intuitive way using a recursive formulation. The
direction of the first principal component � is found by passing over the data and attempt-ing to maximize the value of � 3����������������� ������$� ( � ��� . where � is the same length� as the data vector ( . Thus the first principal component is the projection on the direc-tion in which the variance of the projection is maximized. Each of the remaining ��� �principal components are found by repeating this process in the remaining orthogonal sub-
space (which reduces in dimensionality by one for each new component we discover). The
principal components are then given by ��� � � �� ( , the projection of ( onto each.The basic goal in PCA is to decorrelate the signal by projecting the data onto orthogonal
axes. Furthermore, we usually reduce the dimension of the data from � to ( �*� ).It can be shown that the PCA representation is an optimal linear dimension reduction
technique in the mean-square sense [1]. One important application of this technique is
for noise reduction, where the data not contained in the first � components is assumedmostly due to noise. Another benefit of this technique is that a projection into a subspace
of a very low dimension, for example two or three, is useful for visualizing the data.
In practice, the computation of the � � can be simply accomplished using the sample co-variance matrix ��� ( ( � � �"! . The � � are the eigenvectors of ! that correspond to the� largest eigenvalues of ! . A method for determining the eigenvalues in this manner
7For example, a transformation matrix [1 0; 0 0] is noninvertible, or singular ( inv([1 0; 0 0]) = [InfInf; Inf Inf] in G �IHKJ � L ) and multiplying a two dimensional signal by this matrix performs a simple reductionof the data by one dimension.
8using a pseudo-inversion technique such as G � HKJ � L$# T&% U �('6
-
is known as Singular Value Decomposition (SVD), which is described below. SVD is also
known as the (discrete) Karhunen-Loève transform, or the Hotelling transform (see Ap-
pendix 12.8.1).
12.2.1 Method of SVD
To determine the principal components of a multi-dimensional signal, we can use the
method of Singular Value Decomposition. Consider a real ��>�� matrix ( of observationswhich may be decomposed as follows;
( � ������� � (3)where
�is a non-square matrix with zero entries everywhere, except on the leading diago-
nal with elements � � arranged in descending order of magnitude. Each �(� is equal to � ������ ,the square root of the eigenvalues of ! �A( � � ( . A stem-plot of these values against theirindex � is known as the singular spectrum. The smaller the eigenvalues are, the less en-ergy along the corresponding eigenvector there is. Therefore, the smallest eigenvalues are
often considered to be due to noise. The columns of
are the eigenvectors of ! and thematrix
�is the matrix of projections of ( onto the eigenvectors of ! [2]. If a truncated
SVD of ( is performed (i.e. we just retain the most significant eigenvectors)9, then thetruncated SVD is given by ) � ���������� � and the columns of the � > � matrix ) arethe noise-reduced signal (see Figure 2).
A routine for performing SVD is as follows:
1. Find the non-zero eigenvalues, �� of the matrix (�� ( and arrange them in descendingorder.
2. Find the orthogonal eigenvectors of the matrix (���( corresponding to the obtainedeigenvalues, and arrange them in the same order to form the column-vectors of the
matrix
3. Form a diagonal matrix�
placing the square roots ��� ��� � of the ������� � � �W� �first eigenvalues of the matrix (�� ( found in step (1) in descending order on theleading diagonal.
4. Find the first column-vectors of the matrix�
: � � ��� C � ( � � ��� � � � �� .5. Add to the matrix
�the rest of � � vectors using the Gram-Schmidt orthogonal-
ization process (see appendix 12.8.2).
9in practice choosing the value of ! depends on the nature of the data, but is often taken to be the kneein the eigenspectrum (see 12.2.3) or as the value where "$#%'&)(+* %�,.- "�/%0&)(1* % and - is some fraction 2�354 687
7
-
12.2.2 Eigenvalue decomposition - a worked example
To find the singular value decomposition of the matrix
( �+- � �� �� �
7: (4)
first we find the eigenvalues, , of the matrix! �\( � ( ��� � �� ���
in the usual manner by letting
! � � � � � (5)so � ! � ��S� � � and ���� � � �� � � ���� � �Evaluating this determinant and solving this characteristic equation for , we find � � �
�� � � � � � , and so . ��� and �� � � . Next we note the number of nonzero eigenvalues, , of the matrix (���( : � � . Then we find the orthonormal eigenvectors of the matrix � corresponding to the eigenvalues � and �� by solving for � E and ��� using and F�and in � ! �
�S� � � � ...
� E � ��� ��� ���� � ��� � ��� ��� � ���� � (6)forming the matrix
��� � E ����� � ��� �� � ��� �� � � ���� (7)where �
���are normalized to unit length. Next we write down the singular value matrix
�which is a
diagonal matrix composed of the square roots of the eigenvalues of ! �A( � ( arranged indescending order of magnitude.
� �+- � � �� � �� � 7: (8)
Next we find the first two columns of�
, (noting that� C � ��� � �� � �"��� � �"��"! � �!�� �#� �� � � � � � ! � � � � ) using the relation
� E � � C
( � E � � �� +- � �� �� � 7: ��� ��� ���� �+,- � #�� ##� ##
798:
8
-
and
� � ��� C � ( ��� � +- � �� �� � 7: ��� ��� � �� � �+- �� � ��� ��
7: �
Using the Gram-Schmitd process (see appendix 12.8.2) we can calculate the third and
remaining orthogonal column of�
to be
� � �+,- � ��� � ��� � ��
798: �
Hence
� ��� � � � � � � � +,- � #� � � ��� ## � �� C � ��� ## C � �� C � ��798:
and the singular value decomposition of the matrix ( is( �
+,- � #� � � ��� ## � �� C � ��� ## C � �� C � ��
798:
+- � � �� �� � 7: ��� �� � ��� �� C � �� �
12.2.3 SVD filtering - a practical example using the ECG
We will now look at a more practical (and complicated) illustration. SVD is a commonly
employed technique to compress and/or filter the ECG. In particular, if we align � heart-beats, each � samples long, in a matrix (of size �4>�� ), we can compress it down (into an�=> ) matrix, using only the first � � � principal components. If we then reconstructthe data by inverting the reduced rank matrix, we effectively filter the original data.
Fig. 2a is a set of 64 heart beat waveforms which have been cut into one second segments
(200Hz data), aligned by their R-peaks and placed side-by-side to form a � > ��� � matrix.The data is therefore 8-dimensional and an SVD will lead to 8 eigenvectors. Fig. 2b
is the eigenspectrum after SVD10. Note that most of the power is contained in the first
eigenvector. The knee of the eigenspectrum is at the second principal component. Fig. 2c
is a plot of the reconstruction (filtering) of the data using just the first eigenvector11. Fig.
2d is the same as Fig. 2c, but the first two eigenvectors have been used to reconstruct the
data. The data in Fig. 2d is therefore noisier than that in Fig. 2c.
Note that�
derived from a full SVD (using Matlab’s SVD()) is an invertible matrix, and
no information is lost if we retain all the principal components. However, if we perform a
10[U S V]=SVD(DATA); STEM(DIAG(S))11[U S V]=SVDS(DATA,1); MESH(U*S*V’)
9
-
0 50 100 150 200 05
10−1
−0.5
0
0.5
1
1.5
0 2 4 6 80
2
4
6
8
10
12
0 50 100 150 200 0
5
10
−0.5
0
0.5
1
1.5
0 50 100 150 200 05
10
−1
−0.5
0
0.5
1
1.5
a b
c d
Figure 2: SVD of eight R-peak aligned P-QRS-T complexes; a) in the original form with a
large amount of in-band noise, b) eigenspectrum of decomposition, c) reconstruction using
only the first principal component, d) reconstruction using only the first two principal
components.
10
-
truncated SVD (using SVDS()) then the inverse (INV(S)) does not exist. The transformation
that effects the filtering is noninvertible and information is lost because�
is singular.
From a compression point of view, SVD is an excellent tool. In the above example we
have taken 8 heartbeats and compressed them down to 24 numbers; the 8 eigenvalues
along the first eigenvector for each beat, the 8-dimensional eigenvector and the eight time
stamps that represent the fiducial point of each beat. I.e. a compression ratio of 256:24
= 8:3. If the orthogonal subspace is the same for every group of beats, we can simply
take any new set of heartbeats and project them onto the first principal component we
have derived here. However, it should be noted that the eigenvectors are likely to change,
based upon the noise in the signal. Therefore, the correct way to do this is to take a large,
representative set of heartbeats and perform SVD upon this [3]. Projecting each new beat
onto these globally derived basis functions leads to a filtering of the signal that is essentially
equivalent to passing the P-QRS-T complex through a set of trained weights of a multi-layer
perceptron (MLP) neural network [4]. Abnormal beats or artifacts erroneously detected
as normal beats will have abnormal eigenvalues (or a highly irregular structure when
reconstructed by the MLP). In this way, beat classification can be performed. It should
be noted however, that in order to retain all the subtleties of the QRS complex, at least 5
eigenvectors are required (and another five for the rest of the beat). At a sampling rate
of ��� Hz and an average beat-to-beat interval of ������� The compression ratio is therefore� # ���� �K/2��� C �� � � � using principal components. (The first quotient rescales the compressionratio by the average RR interval, since increases in heart rate and hence increases in the
average RR interval lead to a reduction in the compression factor. The second quotient is
the ratio of the number of principal components considered noise to the number used to
encode the signal.)
12.2.4 Signal separation - a general approach
Using SVD we have seen how we can separate a signal into a subspace that is signal and
a subspace that is essentially noise. This is done by assuming that only the eigenvectors
associated with the largest eigenvalues represent the signal, and the remaining ( � � )eigenvalues are associated with the noise subspace. We try to maximize the independence
between the eigenvectors that span these subspaces by requiring them to be orthogonal.
However, the differences between signals and noise are not always clear, and orthogonal
subspaces may not be the best way to differentiate between the constituent sources in a
measured signal.
The term noise generally means the stochastic or non-periodic12) part of the signal we wish
to remove in order to analyze the periodic contribution to the signal. However, this term
can be misleading, and sometimes the noise is information carrying (if it is not white),
transient and can even be deterministic.
12There are no spikes in the power spectrum for this section of the signal. Periodic noise (sometimes
called artifact) is often in-band or non-stationary and therefore cannot be removed by traditional filtering
techniques; see the sections on ICA for further discussion of this.
11
-
So far we have looked at techniques that deal well with linear, stationary, white, Gaus-
sian noise. However, since many signals are nonlinear, nonstationary, colored or non-Gaussian, we should consider how these techniques deal with such signals and use al-
ternative methods when they fail. In the next section we will consider all the sources in
a set of observations to be signals (or source components), rather than the conventional
signal/noise paradigm and use a higher order moment-based statistical technique for ex-
tracting out each source (one or more of which may represent what we conventionally
think of as noise).
12.3 Blind Source Separation; the Cocktail Party Problem
The Cocktail Party Problem is the often-quoted illustration for Blind Source Separation(BSS), the separation of a set of observations into the constituent underlying (statistically
independent) source signals. The Cocktail Party Problem is illustrated in Fig. 3. If each ofthe � voices you can hear at a party are recorded by � microphones, the recordings will bea matrix composed of a set of � vectors, each of which is a (weighted) linear superpositionof the � voices. For a discrete set of � samples, we can denote the sources by an � > �matrix, � , and the � recordings by an � >�� matrix ( . � is therefore transformed into theobservables ( (through the propagation of sound waves through he room) by multiplyingit by a � >�� mixing matrix � such that (��������������� .In order for us to ‘pick out’ a voice from an ensemble of voices in a crowded room, we
must perform some type of BSS to recover the original sources from the observed mixture.
Mathematically, we want to find a demixing matrix @ , which when multiplied by therecordings ( , produces an estimate ) of the sources � . Therefore @ is a set of weights(approximately13) equal to � CFE . One of the key methods for performing BSS is known asIndependent Component Analysis (ICA), where we take advantage of the linear inde-
pendence between the sources.
An excellent interactive example of the cocktail party problem can be found at
��������������������������������� �!�"�#�%$'&)(�*'�#+�)�%���-,.�.��&��#/�','�"0��.�)&��1/�%,'��0"2�*134����5!�
The reader is encouraged to experiment with this URL at this stage. Initially you should
attempt to mix and separate just two different sources, then increase the complexity of the
problem adding more sources. Note that the relative phases of the sources differ slightly
for each recording (microphone) and that the separation of the sources may change in
order and volume. The reason for this is explained in the following sections.
13depending on the performance details of the algorithm used to calculate 612
-
Figure 3: The Cocktail Party Problem: sound waves from � independent speakers ( �� , �"�and �"� left) are superimposed at a cocktail party (center), and are recorded as three mixedsource vectors, �� , ��� and ��� on � microphones (right). The recordings, ( of the sources,� , are an � -sample linear mixture of the sources, such that ( ��� � ����� ��� , where ���� is a� > � linear mixing matrix. An estimate ) , of the sources � , is made by calculating ademixing matrix @ , which acts on ( such that )���@�(*� �� and @ � � CFE .
13
-
12.4 Independent Component Analysis
Independent Component Analysis is a general name for a variety of techniques which seek
to uncover the independent signals source signals from a set of observations that arecomposed of linear mixtures of the underlying sources. Consider ( �� to be a matrix of��� � observed random vectors, � � � a � > � mixing matrix and � � � , the � � � (assumed)source vectors such that (*� � � (9)ICA algorithms attempt to find a separating or demixing matrix @ such that
)
-
The essential difference between ICA and PCA is that PCA uses variance, a second order
moment, rather than kurtosis, a fourth order moment, as a metric to separate the signal
from the noise. Independence between the projections onto the eigenvectors of an SVD is
imposed by requiring that these basis vectors be orthogonal. The subspace formed with
ICA is therefore not necessarily orthogonal and the angles between the subspace depend
upon the exact nature of the data used to calculate the sources.
The fact that SVD imposes orthogonality means that the data has been decorrelated (the
projections onto the eigenvectors have zero covariance). This is a much weaker form of
independence than that imposed by ICA18. Since independence implies uncorrelatedness,
many ICA methods constrain the estimation procedure such that it always gives uncorre-
lated estimates of the independent components. This reduces the number of free parame-
ters, and simplifies the problem.
12.4.1 Gaussianity
We will now look more closely at what the kurtosis of a distribution means, and how this
helps us separate component sources within a signal by imposing independence. The first
two moments of a set of values are well known; the mean and the variance. These are
often sufficient to coarsely characterize a time series. However, they are not unique and
one often forgets that the form of the distribution is often required (usually assumed to be
Gaussian).
The third moment of a distribution is known as the skew and characterizes the degree of
asymmetry about the mean. The skew of a random variable � is given by��������� � � � ��
0� �� � �
�� �� � � (12)
where � is the standard deviation, � is the mean and � the number of data points. Apositive skew signifies a distribution with a tail extending out toward a more positive� and a negative skew signifies a distribution with a tail extending out toward a morenegative � (see Fig. 4a).The fourth moment of a distribution is known as kurtosis and measures the relative peaked-
ness of flatness of a distribution with respect to a Gaussian (normal) distribution. See Fig.
4b. It is defined by
��� �I��� � � �� 0� � � � ��� �� � (13)
where the -3 term makes the value zero for a normal distribution. Note that some con-
ventions add a � � correction factor to the above equation to make a Gaussian have zero18Orthogonality implies independence, but independence does not necessarily imply orthogonality
15
-
Figure 4: Distributions with third and fourth moments [skewness, (a) and kurtosis (b)
respectively] that are significantly different from normal (Gaussian).
kurtosis. This is the convention in Numerical Recipes in C, but Matlab uses the convention
without the -3 term and therefore Gaussian distributions have a �� � ��� . We shall adoptthe Matlab convention in these notes.
A distribution with a positive kurtosis (
� in Eq. (13) ) is termed leptokurtic (or super
Gaussian). A distribution with a negative kurtosis ( � � in Eq. (13) ) is termed platykurtic(or sub Gaussian). Gaussian distributions are termed mesokurtic. Note also that in contrast
to the mean and variance, the skew and kurtosis are dimensionless quantities.
Fig. 5 illustrates the time series, power spectra and distributions of different signals and
noises found on the ECG. From left to right: (i) the underlying Electrocardiogram signal,
(ii) additive (Gaussian) observation noise, (iii) a combination of muscle artifact (MA)
and baseline wander (BW), and (iv) powerline interference; sinusoidal noise with � �� ��� � � � � . Note that all the signals have significant power contributions within thefrequency of interest ( ��� � � � ) where there exists clinically relevant information in theECG. Traditional filtering methods therefore cannot remove these noises without severely
distorting the underlying ECG.
12.5 ICA algorithms
Although the basic idea behind ICA is very simple, the actual implementation can be for-
mulated from many perspectives.
� Maximum likelihood (MacKay [16], Pearlmutter & Parra [17], Cardoso [18], Giro-lami & Fyfe [19])
16
-
2400 2450 2500−2
−1
0
1
2Sinusoidal
1000 1500 2000 2500−2
0
2
4
6
Sig
nal
ECG
1000 1500 2000 2500−4
−2
0
2
4Gaussian Noise
1000 1500 2000 2500
−50
0
50
100
MA / BW
0 50−60
−40
−20
0
20
PS
D
0 10 20 30−60
−40
−20
0
20
0 10 20−60
−40
−20
0
20
40
0 50−60
−40
−20
0
20
−5 0 50
200
400
600
His
togr
am
skew=3 kurt=8 −5 0 50
50
100
150
200
skew=0 kurt=0 −200 0 200
0
200
400
600
skew=1 kurt=2 −2 0 20
500
1000
skew=0 kurt=−1.5
Figure 5: time Series, power spectra and distributions of different signals and noises found
on the ECG. From left to right: (i) the underlying Electrocardiogram signal, (ii) additive
(Gaussian) observation noise, (iii) a combination of muscle artifact (MA) and baseline
wander (BW), and (iv) powerline interference; sinusoidal noise with � � � ��� �
� � � .
17
-
� Higher order moments and cumulant (Comon [20], Hyvärinen & Oja [21], )� Maximization of information transfer (Bell & Sejnowski [22], Amari et al. [23]; Lee
et al. [24])
� Negentropy maximization (Girolami & Fyfe [19])� Nonlinear PCA (Karhunen et al. [25, 26], Oja et al. [27])
Although all these methods use a measure of non-Gaussianity, the actual implementation
can involve either a manipulation of the output data, ) , or a manipulation of the demixingmatrix, @ . We will now examine a selection of these methods that use these techniques.12.5.1 Maximum Likelihood and gradient descent
Independent component analysis can be thought of as a statistical modeling technique that
makes use of latent variables to describe a probability distribution over the observables.
This is known as generative latent variable modeling and each source is found by deduc-ing its corresponding distribution. Following MacKay [16], if we model the
�observable
vectors � ������ � as being generated from latent variables � �(� �K0� � via a linear mapping @ .
with elements� � � . Note that the transpose of � � � is written � � � . To simplify the derivation
we assume the number of sources equals the number of observations ( � � � ), and thedata is then defined to be, �6� � (����� � 0� � , where � is the number of samples in each ofthe
�observations. The latent variables are assumed to be independently distributed with
marginal distributions� � � �� � �(� ��2� ����� where � denotes the assumed form of this model
and �� the assumed probability distributions of the latent variables.Given � � @ C and � , the probability of the observables ( and the hidden variables � is
� � � ( ���� � 0� � � � � ���� � 0� � @ ��� �6� 0�� � � � � � ���� � � ��@�� � � ��� ���� � ��� (14)� 0�� � �
� � ��� � � ����� � ��� � � � ����� ��� � �
� ��2� � ����� ��� � �(15)
Note that for simplicity we have assumed that the observations ( have been generatedwithout noise19. If we replace the term
� ��� ����� ��� � � � � � ����� � by a (noise) probability dis-tribution over � ����� with mean � � � � � � ����� and a small standard deviation, the identicalalgorithm results [16].
To calculate� � � , the elements of @ we can use the method of gradient descent. which
requires a dimensionless objective function of the parameters Ł ��@6� to be optimized. The19This leads to the Bell-Sejnowski algorithm [16, 22].
18
-
sequential update of the parameters � � are then computed as� � � ��� � Ł� � � (16)where � is the learning rate20.The quantity Ł � @=� we choose to ‘guide’ the gradient descent of the elements of @ . Thequantity we choose for ICA (to maximize independence) is the log likelihood function��� � � � � � � @ � � �� � ��� � � � � � � � � � � � @ � � ��� (17)which is the natural log of a product of the (independent) factors each of which is obtained
by marginalizing over the latent variables.
OPTIONAL: (the rest of this subsection is for the avid reader only)
Recalling that for scalars � �+� � ��� � � � � � ��� � � � � ���� � � and adopting a conventional indexsummation such that
� � � � ����� � � � � � � � ����� , a single factor in the likelihood is given by� ��� ���� @ � � �6� � � 0 � ���� � ��� ���� � ���� ��@6� � � � ���� � (18)
� � ���� � � � ��� ����� � � � � � ����� � �� ��2� � ����� � (19)
� ������� @ � � ���� � C ��
��� (20)
(21)
which implies ��� � � � � � ���� @ � � � � ��� � � �����F@ � ��� ��� � � �� � � � � � � � � (22)We shall then define �� � � � ��� � � �����F@ � � C � � ��� � � (23)�� � � � � C ��� � � � C � � � C � � � ��� � ��� � � (24)�� � � � � � � � � �"! �� � �#� �%$ � � � (25)
(26)
20which can be fixed or variable to aid convergence depending on the form of the underlying source
distributions
19
-
with
�some arbitrary function,
�� � � � � � � and � � � � ���� � ��� � � ��2� � ����� � � � . � � indicates in
which direction�� needs to change to make the probability of the data greater. Using
equations 24 and 25 we can obtain the gradient of� � ��� � � � ��� � � � ��� ���� @D� � � � � � � � � � � � ����� ��� � (27)
where ��� is a dummy index. Alternatively, we can take the derivative with respect to � � ��� � � � ��� � � � ��� ���� �?� � � � � � �%� � � � � (28)If we choose @ so as to ascend this gradient, we obtain precisely the learning algorithmfrom Bell and Sejnowski [22] (
� @ �B@ � C � �$� � ). A detailed mathematical analysisof gradient descent/ascent and its relationship to neural networks and PCA are given in
Appendix 12.8.3. (Treat this as optional reading).
In practice, the choice of the nonlinearity, , in the update equations for learning @ isheavily dependent on the distribution of the underlying sources. For example, if we choose
a traditional � ��
� nonlinearity ( �%� � ��� � � � ��
� ��� � � � ), with � a constant initially equal tounity, then we are implicitly assuming the source densities are heavier tailed distributions
than Gaussian ( ��2� ������� � ��� ��� � � ������� � � � �����"� ��� ����� , ��� � � � � � ��� , with � � � � � � ��� � � �1� ).Varying � reflects our changing beliefs in the underlying source distributions. In the limitof large � , the nonlinearity becomes a step function and ���� ����� becomes a biexponentialdistribution ( ��2� ������� �!C � ). As � tends to zero, ��� ��� � approach more Gaussian distribu-tions.
If we have no nonlinearity, � � � ���&� ��� � � , the we are implicitly assuming a Gaussian dis-tribution on the latent variables. However, it is well known [28, 4] that without a non-
linearity, the gradient descent algorithm leads to second order decorrelation. That is, we
perform the same function as SVD. Equivalently, the Gaussian distribution on the latent
variables is invariant under rotation of the latent variables, so there is no information to
enable us to find a preferred alignment of the latent variable space. This is one reason why
conventional ICA is only able to separate non-Gaussian sources.
It should be noted that the principle of covariance (consistent algorithms should give
the same results independently of the units in which the quantities are measured) is not
always true. One example is the popular steepest descent rule (see Eq. 16) which is
dimensionally inconsistent; the left hand side has dimensions of � � � � and the right handside has dimensions � � � � ( � ��@=� is dimensionless).One method for alleviating this problem is to precondition the input data (scaling it be-
tween
�
). Another method is to decrease � at a rate of � � � where � is the number ofiterations through the backpropagation of the updates of the � � . The Munro-Robbins theo-rem ([29] p.41) shows that the parameters will asymptotically converge to the maximum
likelihood parameters since each data point receives equal weighting. If � is held constantthen one is explicitly solving a weighted maximum likelihood problem with an exponential
weighting of the data and the parameters will not converge to a limiting value.
20
-
The algorithm would be covariant if� � � � � � ����� � � �������� � , where � is a curvature matrixwith the ��� � � element having dimensions � � � � �� � . It should be noted that the differential of
the metric for the gradient descent is not linear (as it is for a least square computation),
and so the space on which we perform gradient descent is not Euclidian. In fact, one must
use the natural [23] or relative [30] gradient. See [16] and [23] for further discussion on
this topic.
12.5.2 Negentropy as a cost function
Although kurtosis is theoretically a good measure of non-Gaussianity, it is disproportion-
ately sensitive to changes in the distribution tails. Therefore, other measures of indepen-
dence are often used. Another important measure of non-Gaussianity is given by negen-tropy. Negentropy is often a more robust (outlier insensitive) method for estimating the
fourth moment. Negentropy is based on the information-theoretic quantity of (differen-
tial) entropy. The more random (i.e. unpredictable and unstructured the variable is) the
larger its entropy. More rigorously, entropy is closely related to the coding length of the
random variable, in fact, under some simplifying assumptions, entropy is the coding length
of the random variable. The entropy � is of a discrete random variable � is defined as
� ��3� � � ��� �� � � ��� ��� � � � �� � � ��� (29)
where the
�� are the possible values of � . This very well known definition can be gener-
alized for continuous-valued random variables and vectors, in which case it is often called
differential entropy. The differential entropy � of a random vector � with density � ��� � isdefined as
� ��� � � � � � ��� � ��� � � � � � � d � � (30)A fundamental result of information theory is that a Gaussian variable has the largest
entropy among all random variables of equal variance. This means that entropy could be
used as a measure of non-Gaussianity. In fact, this shows that the Gaussian distribution
is the “most random” or the least structured of all distributions. Entropy is small for
distributions that are clearly concentrated on certain values, i.e., when the variable is
clearly clustered, or has a PDF that is very “spikey”.
To obtain a measure of non-Gaussianity that is zero for a Gaussian variable and always
nonnegative, one often uses a slightly modified version of the definition of differential
entropy, called negentropy. Negentropy�
is defined as follows� ��� � � � ����� �� � ��� � � ��� � (31)
where ��� �� � � is a Gaussian random variable of the same covariance matrix as � . Due to theabove-mentioned properties, negentropy is always non-negative, and it is zero if and only
21
-
if � has a Gaussian distribution. Negentropy has the additional interesting property that itis invariant for invertible linear transformations.
The advantage of using negentropy, or, equivalently, differential entropy, as a measure of
non-Gaussianity is that it is well justified by statistical theory. In fact, negentropy is in
some sense the optimal estimator of non-Gaussianity, as far as statistical properties are
concerned. The problem in using negentropy is, however, that it is computationally very
difficult. Estimating negentropy using the definition would require an estimate (possi-
bly nonparametric) of the PDF. Therefore, simpler approximations of negentropy are very
useful. One such approximation actually involves kurtosis:
� ��� � � �� � � � � � � � � �� � �� �I��� � � (32)but this suffers from the problems encountered with kurtosis. Another estimate of negen-
tropy involves entropy:� ��� � � � ����� ��� � � � ����� � � � � � � (33)
where � is a zero mean unit variance Gaussian variable and the functions � � are some non-quadratic functions which lead to the approximation always being non-negative (or zero if� has a Gaussian distribution). � � is usually taken to be ��� ��� � � � ��� � � � � � or � � � ��� � � C�� ��with
��� � � � . If G(y)=y, this degenerates to the definition of kurtosis.12.5.3 Mutual Information based ICA
Using the concept of differential entropy, we define the mutual information (MI) between� (scalar) random variables, ��� , � � � � � � � as follows
���;�� �!�"� � � � � � � � � �� � � � ������� � � � � � � (34)MI is an intuitive measure of the (in-) dependence between random variables and is equiv-
alent to the Kullback-Leibler divergence between the joint density � ��� � and the productof its marginal densities [31]. MI is always non-negative, and zero if and only if the vari-
ables are statistically independent. MI therefore takes into account the whole dependence
structure of the variables, and not only the covariance, like PCA and related methods.
Note that for an invertible linear transformation )
-
where � is a constant that does not depend on � . This shows the fundamental relationshipbetween negentropy and MI.
12.5.4 Fourth order cumulant based ICA algorithm; JADE
The JADE (Joint Approximate Diagonalization of Eigen-matrices) algorithm developed by
Cardoso [32] is summarized as follows:
1. Initialization: Estimate a whitening matrix using SVD � � ����� � and premultiplythe observations by this: ( � �!( .
2. Form the statistics: Estimate a maximal set ����� � of the cumulant matrix
�� �1�'� � ��� � � � � � �2�F� � � �� � � � �� � � � � � � �� � � � � � (37)where � is an �?> � random vector, � is any �?>�� matrix, � �1� denotes the trace of amatrix and !��3��� ��( � ( � is the covariance matrix.
3. Optimize orthogonal contrast: Find a rotation matrix � such that the cumulantmatrix is as diagonal as possible.
4. Separate: Estimate the mixing matrix � as @ ����� CFE and the sources as � ��� CFE ( .12.6 Applications of ICA
ICA has been used to perform signal separation in many different domains. These include:
� Blind source separation; Watermarking, Audio [33], ECG, (Bell & Sejnowski [22],Barros et al. [34], McSharry et al. [13]), EEG (Mackeig et al. [35, 36], ).
� Signal and image denoising (Hyvärinen - [37] ), medical (fMRI - [38]) ECG EEG(Mackeig et al. [35, 36])
� Modeling of the hippocampus and visual cortex (Lorincz, Hyvärinen)� Feature extraction and clustering, (Marni Bartlett, Girolami, Kolenda)� Compression, redundancy reduction (Girolami, Kolenda)
Another example of source separation is maternal/fetal ECG separation from an abdominal
recording (which contains both the mother and the fetal ECG signals superimposed). We
will, however, leave an analysis of this to the laboratory assignment, and as a simpler
example, we will study the problem of transient noise on the ECG.
23
-
12.6.1 ICA for removing noise on the ECG
Figure 6 provides an excellent illustration of the power of ICA to remove artifacts from
the ECG. Here we see 10 seconds of 3 leads of ECG before and after ICA decomposition
(upper and lower graphs respectively). the upper plot (�) is the same data as in Fig. 1.
Note that ICA has separated out the observed signals into three specific sources; 1b) The
ECG, 2b) High kurtosis transient (movement) artifacts, and 2c) Low kurtosis continuous
(observation) noise. In particular, ICA has separated out the in-band QRS-like spikes that
occurred at 2.6 and 5.1 seconds. Furthermore, time-coincident artifacts at 1.6 seconds that
distorted the QRS complex, were extracted, leaving the underlying morphology intact.
Figure 6: 10 seconds of 3 Channel ECG � ) before ICA decomposition and�
) after ICA
decomposition. Plot � is the same data as in Fig. 1. Note that ICA has separated out the
observed signals into three specific sources; 1�
) The ECG, 2�
) High kurtosis transient
(movement) artifacts, and 2 � ) Low kurtosis continuous (observation) noise.
Relating this to the cocktail party problem, we have three speakers in three locations. First
and foremost we have the series of cardiac depolarization/repolarization events corre-
sponding to each heartbeat, located in the chest. Each electrode is roughly equidistant
from each of these. Note that the amplitude of the third lead is lower than the other two,
illustrating how the cardiac activity in the heart is not spherically symmetrical.
However, we should not assume that ICA is a panacea to cure all noise. In most situations,
complications due to lead position, a low signal-noise ratio, and positional changes in
the sources cause serious problems. The next section addresses many of the problems in
employing ICA and as usual, using the ECG as a practical illustrative guide.
24
-
Figure 7: Data from Fig. 1 after ICA decomposition, (Fig 6) and reconstruction. See text
for explanation.
It should also be noted that the ICA decomposition does not necessarily mean the relevant
clinical characteristics of the ECG have been preserved (since our interpretive knowledge
of the ECG is based upon the observations, not the sources). Therefore, in order to recon-
struct the original ECGs in the absence of noise, we must set to zero the columns of the
demixing matrix that correspond to artifacts or noise, then invert it and multiply by the
decomposed data to ’‘restore’ the original ECG observations. An example of this using the
data in Fig. 1 and Fig. 6 are presented in Fig. 7.
12.7 Limitations of ICA
12.7.1 Stationary Mixing
ICA assumes a linear stationary mixing model (the mixing matrix is a set of constantsindependent of the changing structure of the data over time). However, for many appli-
cations this is only true from certain observation points or for very short lengths of time.
For example, consider the above case of noise on the ECG. As the subject inhales, the chest
expands and the diaphragm lowers. This causes the heart to drop and rotate slightly. If
we consider the mixing matrix � to be composed of a stationary component � � and anon-stationary component � � � such that � � � � � � � � then � � � is equal to some constant( � ) times one of the rotation matrices21 such as
� � � � �#� ��� +- � � �� � ��� � �#�2� � ��� � �$�2�� ��� � �#�2� � ��� � �$���7: �
where � � ��� ��� � � � and ��� � � � is the frequency of respiration22. In this case, � will be afunction of
�, the angle between the different sources (the electrical signals from muscle
contractions and those from cardiac activity), which itself is a function of time. This is
21see Eq. 108 in Appendix 12.8.522assuming an idealized sinusoidal respiratory cycle
25
-
only true for small values of � , and hence a small angle�, between each source. This is a
major reason for the differences in effectiveness of ICA for source separation for different
lead configurations.
12.8 Summary and further reading
The basic idea in this chapter has been to explain how we can apply a transformation
to a set of observations in order to project them onto a set of basis functions that are
in some way more informative than the observation space. This is achieved by defining
some contrast function between the data in the projected subspace which is essentially
a measure of independence. If this contrast function is second order (variance) then we
perform decorrealtion through PCA. If the contrast function is fourth order and therefore
related to Gaussianity, then we achieve ICA. The manner in which we iteratively estiamte
the demixing matrix using a Gaussian-related cost function encodes our prior beliefs as to
the Gaussianity (kurtosis) of the source distributions. The data projected onto the source
components is as statistically independent as possible. We may then select which projec-
tions we are interested in and, after discarding the uniteresting components, invert the
transformation to effect a filtering of the data.
ICA covers an extremely broad class of algorithms, as we have already seen. Lee et al. [39]
show that different theories recently proposed for Independent Component Analysis (ICA)
lead to the same iterative learning algorithm for blind separation of mixed independent
sources. This is because all the algorithms attempt to perform a separation onto a set of
basis functions that are in some way independent, and that the independence can alwaysbe recast as a departure from Gaussianity.
However, the concept of blind source separation is far more broad than this chapter reveals.
ICA has been the fertile meeting ground of statistical modeling [40], PCA [41], neural
networks [42], Independent Factor Analysis Wiener filtering [11], wavelets [43, 44, 45],
hidden Markov modeling [46, 7, 47], Kalman filtering [48] and nonlinear dynamics [14,
49]. Many of the problems we have presented in this chapter have been addressed by
extending the ICA model with these tools. Although these concepts are outside the scope
of this course, they are currently the focus of interesting research. For further reading on
ICA and related research, the reader is encouraged to browse at the following URLs:
����������������������3+0 ���-,.01/�� *��"�+��������������'*������ *��!�1, ��� ���� *��)�!��� � ,"$+�)���%�)�-, � ���� 0������������������� $'&�!&�!����&� � ,%� ��./ ���)�"("$ &�!�
26
-
References
[1] Jolliffe IT. Principal Component Analysis. New York: Springer-Verlag, 1988.
[2] Golub GH, Van Loan CF. Matrix Computations.� � � edition. Oxford: North Oxford
Academic, 1983.
[3] Moody GB, Mark RG. QRS morphology representation and noise estimation using
the Karhunen-Loève transform. Computers in Cardiology 1989;269–272.
[4] Clifford GD, Tarassenko L. One-pass training of optimal architecture auto-associative
neural network for detecting ectopic beats. IEE Electronic Letters Aug 2001;
37(18):1126–1127.
[5] Trotter HF. An elementary proof of the central limit theorem. Arch Math 1959;
10:226–234.
[6] Penny W, Roberts S, Everson R. Ica: model order selection and dynamic source
models, 2001. URL����%*'�-*�*"$ � �)�#��� ����� *��)�!�#�'*13�3������)�)�", � ����+0
.
[7] Choudrey RA, Roberts SJ. Bayesian ica with hidden markov model sources. In Inter-
national Conference on Independent Component Analysis. Nara, Japan, 2003; 809–
814. URL������� $ & �!&�!� ��&� � ,'�����/ � �',"$�5'�#��� �+0.�)�-,- �"+��� ����+0
.
[8] Joho M, Mathis H, Lambert R. Overdetermined blind source separation:
Using more sensors than source signals in a noisy mixture, 2000. URL����%*%�"*�*"$ � ���#���+�#��� *��"�+�"(.&��+&����%&��%*"$ �%*-%*)$�����3'*�� � ����+0.
[9] Lee T, Lewicki M, Girolami M, Sejnowski T. Blind source separation of
more sources than mixtures using overcomplete representations, 1998. URL����%*%�"*�*"$ � ���#���+�#��� *��"�+��0"*�*�� �+0%� 3 ��� ���� 0.
[10] Lewicki MS, Sejnowski TJ. Learning overcomplete representations. Neural Computa-
tion 2000;12(2):337–365. URL����'*%�-*�*"$�� �)�#�� �+����� *��"�!��0"*)�+�)�#/�����%0�*�,"$�3 � 3.5�������0
.
[11] Single SW. Non negative sparse representation for wiener based source. URL����%*%�"*�*"$ � ���#���+�#��� *��"�+��������� � ���� 0.
[12] Clifford GD, McSharry PE. A realistic coupled nonlinear artificial ECG, BP, and res-
piratory signal generator for assessing noise performance of biomedical signal pro-
cessing algorithms. Proc of SPIE International Symposium on Fluctuations and Noise
2004;5467(34):290–301.
[13] McSharry PE, Clifford GD. A comparison of nonlinear noise reduction and indepen-
dent component analysis using a realistic dynamical model of the electrocardiogram.
Proc of SPIE International Symposium on Fluctuations and Noise 2004;5467(09):78–
88.
27
-
[14] James CJ, Lowe D. Extracting multisource brain activity from a single electromag-
netic channel. Artificial Intelligence in Medicine May 2003;28(1):89–104.
[15] Broomhead DS, King GP. Extracting qualitative dynamics from experimental data.
Physica D 1986;20:217–236.
[16] MacKay DJC. Maximum likelihood and covariant al-
gorithms for independent component analysis, 1996.���)��������������� � 3.�.*)$.*13+�-* ���� �����-,��4� ,%� � ��/ � �!,%��/%,��'�),�� �#.$.,'�#+�)�%���-, � � ��+0.
[17] Pearlmutter BA, Parra LC. Maximum likelihood blind source separation: A context-
sensitive generalization of ICA. In Mozer MC, Jordan MI, Petsche T (eds.), Advances
in Neural Information Processing Systems, volume 9. The MIT Press, 1997; 613. URL����%*%�"*�*"$ � ���#���+�#��� *��"�+�#�'*�,)$'0 �.�.�%*)$ ���!,�!� ��� � � ����+0.
[18] Cardoso J. Infomax and maximum likelihood for blind source sepa-
ration. IEEE Signal Processing Letters April 1997;4(4):112–114. URL����%*%�"*�*"$ � ���#���+�#��� *��"�+���-,)$ �'&���&�� ��3%�'& �!,� � ����+0.
[19] Girolami M, Fyfe C. Negentropy and kurtosis as projection pursuit indices provide
generalised ICA algorithms. In A. C, Back A (eds.), NIPS-96 Blind Signal Separation
Workshop, volume 8. 1996; .
[20] Comon P. Independent component analysis, a new concept? Signal Processing 1994;
36:287–314.
[21] Hyvärinen A, Oja A. A fast fixed point algorithm for independent component analysis.
Neural Computation 1997;9:1483–1492.
[22] Bell AJ, Sejnowski TJ. An information-maximization approach to blind separa-
tion and blind deconvolution. Neural Computation 1995;7(6):1129–1159. URL����%*%�"*�*"$ � ���#���+�#��� *��"�+� �'*%0�0��� ��3.�'&-$�� ,- �) � ,�!� ��� � ,"+�) ��.��+0 .[23] Amari S, Cichocki A, Yang HH. A new learning algorithm for blind signal sep-
aration. In Touretzky DS, Mozer MC, Hasselmo ME (eds.), Advances in Neural
Information Processing Systems, volume 8. The MIT Press, 1996; 757–763. URL����%*%�"*�*"$ � ���#���+�#��� *��"�+�",�!,)$!����-3'*-��� ���� 0.
[24] Lee TW, Girolami M, Sejnowski TJ. Independent component analysis using an ex-
tended infomax algorithm for mixed sub-gaussian and super-gaussian sources. Neu-
ral Computation 1999;11(2):417–441.
[25] Karhunen J, Joutsensalo J. Representation and separation of signals using nonlinear
PCA type learning. Neural Networks 1994;7:113–127.
[26] Karhunen J, Wang L, Vigario R. Nonlinear PCA type approaches for
source separation and independent component analysis, 1995. URL����%*%�"*�*"$ � ���#���+�#��� *��"�+��/%,)$)����3 *-3 ��-3+!0%��3'*�,"$���.�� 0.
28
-
[27] Oja E. The nonlinear PCA learning rule and signal separation – mathematical analy-
sis, 1995. URL����'*%�-*�*"$�� �)�#�� �+����� *��"�!��&"(�,��13 !0.�#3'*�,"$�������0
.
[28] Bourlard H, Kamp Y. Auto-association by multilayer perceptrons and singular value
decomposition. Biol Cybern 1988;(59):291–294.
[29] Bishop C. Neural Networks for Pattern Recognition. New York: Oxford University
Press, 1995.
[30] Cardoso J, Laheld B. Equivariant adaptive source separation, 1996. URL����%*%�"*�*"$ � ���#���+�#��� *��"�+���-,)$ �'&���&� ��* � � � �',)$!�-,13.�� ����+0 .[31] Cardoso JF. Entropic contrasts for source separation. In Haykin S (ed.), Adaptive
Unsupervised Learning. 1996; .
[32] Cardoso JF. Multidimensional independent component analysis. Proc ICASSP 1998;
Seattle.
[33] Toch B, Lowe D, Saad D. Watermarking of audio signals using ICA. In Third Interna-
tional Conference on Web Delivering of Music, volume 8. 2003; 71–74.
[34] Barros A, Mansour A, Ohnishi N. Adaptive blind elimination of artifacts in ECG
signals. In Proceedings of I and ANN. 1998; .
[35] Makeig S, Bell AJ, Jung TP, Sejnowski TJ. Independent component analysis of elec-
troencephalographic data. In Touretzky DS, Mozer MC, Hasselmo ME (eds.), Ad-
vances in Neural Information Processing Systems, volume 8. The MIT Press, 1996;
145–151. URL����%*%�"*�*"$ � ���#���+�#��� *��"�+� � ,-/%*!�#5� �!� 3 �.*"�'*13 �.*13� ������0
.
[36] Jung TP, Humphries C, Lee TW, Makeig S, McKeown MJ, Iragui V, Se-
jnowski TJ. Extended ICA removes artifacts from electroencephalographic
recordings. In Jordan MI, Kearns MJ, Solla SA (eds.), Advances in Neu-
ral Information Processing Systems, volume 10. The MIT Press, 1998; URL����%*%�"*�*"$ � ���#���+�#��� *��"�+�","$� �)�)0)*%�"(1��3%5��.*�)'*13��.*�� � ����+0.
[37] Hyvärinen A. Sparse code shrinkage: Denoising of nongaussian data by maximum
likelihood estimation. Neural Computation 1999;11(7):1739–1768.
[38] Hansen LK. ICA of fMRI based on a convolutive mixture model. In Ninth Annual
Meeting of the Organization for Human Brain Mapping (HBM 2003), NewYork, 2003
June. 2003; URL���)������������� � � ��� � ����� � ���.��� � � ������%�), �+�#.$%,%�#!�),��+��.$.,%�1������ �����
.
[39] Lee TW, Girolami M, Bell AJ, Sejnowski TJ. A unifying information-
theoretic framework for independent component analysis, 1998. URL����%*%�"*�*"$ � ���#���+�#��� *��"�+��0"*�*�1��3��#���+��3.5 � �.�� 0.
[40] Lee TW, Lewicki MS, Sejnowski TJ. ICA mixture models for unsupervised clas-
sification of non-gaussian classes and automatic context switching in blind signal
separation. IEEE Transactions on Pattern Analysis and Machine Intelligence 2000;
22(10):1078–1089. URL����%*'�-*�*"$ � �)�#��� ����� *��)�!������ � ��� �+0
.
29
-
[41] Karhunen J, Pajunen P, Oja E. The nonlinear PCA criterion in blind source sep-
aration: Relations with other approaches. Neurocomputing 1998;22:520. URL����%*%�"*�*"$ � ���#���+�#��� *��"�+��/%,)$)����3 *-3 ��"3+!0%��3'*�,"$���.�� 0.
[42] Amari S, Cichocki A, Yang H. Recurrent neural networks for blind separation of
sources, 1995. URL����%*%�"*�*"$ � ���#���+�#��� *��"�+�",� ,�$!�����$.*%���'$�$.*-3.���.��+0
.
[43] Roberts S, Roussos E, Choudrey R. Hierarchy, priors and wavelets: structure and
signal modelling using ica. Signal Process 2004;84(2):283–297. ISSN 0165-1684.
[44] Portilla J, Strela V, Wainwright M, Simoncelli E. Adaptive wiener denoising
using a gaussian scale mixture model in the wavelet domain; 37–40. URL����%*%�"*�*"$ � ���#���+�#��� *��"�+��� ������ � ���� 0.
[45] Simoncelli EP. Bayesian denoising of visual images in the wavelet do-
main. In Müller P, Vidakovic B (eds.), Bayesian Inference in Wavelet
Based Models. New York: Springer-Verlag, Spring 1999; 291–308. URL����%*%�"*�*"$ � ���#���+�#��� *��"�+����� � +�"*%0�0.����',��%*%�.�1,-3�� ����+0.
[46] Penny W, Everson R, Roberts S. Hidden markov independent components analysis,
2000. URL����%*%�"*�*"$ � ���#���+�#��� *��"�+�#�'*13.3�����-� � � �%*13�� �.��+0
.
[47] Stephen WP. Hidden markov independent components for biosignal analysis. URL����%*%�"*�*"$ � ���#���+�#��� *��"�+� ������ � � ���� 0.
[48] Everson R, Roberts S. Particle filters for nonstationary ica, 2001. URL����%*%�"*�*"$ � ���#���+�#��� *��"�+�"*��'*)$ �)&�3 ��� � ,)$)+����0"* ��.�� 0.
[49] Valpola H, Karhunen J. An unsupervised ensemble learning method for nonlinear
dynamic state-space models. Neural Comput 2002;14(11):2647–2692. ISSN 0899-
7667.
[50] Clifford GD, Tarassenko L, Townsend N. Fusing conventional ECG QRS detection al-
gorithms with an auto-associative neural network for the detection of ectopic beats.
In� ���
International Conference on Signal Processing. IFIP, Beijing, China: World
Computer Congress, August 2000; 1623–1628.
[51] Tarassenko L, Clifford GD, Townsend N. Detection of ectopic beats in the electro-
cardiogram using an auto-associative neural network. Neural Processing Letters Aug
2001;14(1):15–25.
[52] Golub GH. Least squares, singular values and matrix approximations. Applikace
Matematiky 1968;(13):44–51.
[53] Bunch J, Nielsen C. Updating the singular value decomposition. Numer Math 1978;
(31):111–129.
30
-
Appendix A:
12.8.1 Karhunen-Loéve or Hotelling Transformation
The Karhunen-Loéve transformation maps vectors � � in a � -dimensional space ����W� � � � ��� � �onto vectors � � in an -dimensional space � �#�� � � � � � � � , where � � .The vector � � can be represented as a linear combination of a set of � orthonormal vectors���
� � ��� � ��� � � (38)
Where the vectors � � satisfy the orthonormality relation��� �
�� � � � (39)
in which���
is the Kronecker delta symbol.
This transformation can be regarded as a simple rotation of the coordinate system from
the original x’s to a new set of coordinates represented by the z’s. The ��� are given by���.��� � � � (40)
Suppose that only a subset of � � basis vectors � � are retained so that we use only coefficients of � � . The remaining coefficients will be replaced by constants � � so that eachvector � is approximated by the expression
�� ���� � ��� � ���
��� � ��� � � ��� (41)
The residual error in the vector � � introduced by the dimensionality reduction is given by� � � �� � � ��
� � ��� � ���$� � � � ��� (42)We can then define the best approximation to be that which minimises the sum of the
squares of the errors over the whole data set. Thus we minimise
� � � �� 0� � � ��
� � ��� � ����� � ��� � (43)If we set the derivative of � � with respect to � � to zero we find
� � � ��0� � � � �� � � � ���� (44)31
-
Where we have defined the vector �� to be the mean vector of the � vectors,��?� ��
0� � � � � (45)We can now write the sum-of-squares-error as
� � � �� 0� � � ��
� � ��� ��� � � ��� � � �� ��� �� �� 0� � � � � ��� ��� (46)
Where � is the covariance matrix of the set of vectors � � and is given by� � � � � � � � �� �I��� � � �� � � (47)
It can be shown (see Bishop [29]) that the minimum occurs when the basis vectors satisfy
� � � � �� ��� (48)so that they are eigen vectors of the covariance matrix. Substituting (48) into (46) and
making use of the orthonormality relation of (39), the error criterion at the minimum is
given by
� � � �� ��� � ��� � (49)
Thus the minimum error is obtained by choosing the � � smallest eigenvalues, and theircorresponding eigenvectors, as the ones to discard.
Appendix B:
12.8.2 Gram-Schmidt Orthogonalization Theorem
If � � E � � � � ����� � is a linearly independent vector system in the vector space with scalar prod-uct � , then there exists an orthonormal system ��� W� � � � � � � � , such that
� �� � � E � � � � ����� � � � � � ��� �� � � � � � � � � (50)
This assertion can be proved by induction. In the case � � � , we define � �A� E � � E andthus �
�� � � E � � � � � ��� � . Now assume that the proposition holds for � ����� � , i.e., there
exists an orthonormal system ��� � � � � � � � C � , such that � � � � � E � � � � ��� � CFE � � � � � ��� �� � � � � � � C � .Then consider the vector
� �F� . � � � � � � �� C � � C � � ��� (51)32
-
choosing the coefficients ��!� � � � � � � � � � so that � � � � � � � � � � � � � � , i.e. � � ��� � �S� � � .This leads to the � � � conditions
��$� � �#� � �"�"� � � ��� � �S�=� � � (52)
�� � � ��� ��� � ��� � � � � � ��� � � �
Therefore, � ���\� � �A��� ��� � � � � � � � ����� ��� � � C � � � C � (53)Now we choose � � �'� ��� � � . Since � ��� � � � � � E � � � � ��� � CFE � � � � � � � � � � , we get, by theconstruction of the vectors � � and � � , ( � ��� � � � � � E � � � � ��� � � ). Hence
� �� ��� W� � � � � � � � � � � � � E � � � � ��� � � � (54)
From the representation of the vector � � we can see that � � is a linear combination of thevectors � �� � � � � � � . Thus
� �� � � E � � � � ��� � �� � � � ��� �� � � � � � � � (55)
and finally,
� �� � � E � � � � ��� � � ��� � � ��� �� � � � � � � � (56)
An example
Given a vector system � � E ��� � ���� � in � , where� E ��� � � � � � � , � � ��� � � � � � � , ��� � � ��� � � � ,such that (*��� � E � � �� � � we want to find such an orthogonal system ��� �� � �"� � � � , for which
� �� � � E ��� � ���� � � � � � ��� � � �S� � � � �
To apply the orthogonalization process, we first check the system � � E ��� � ��� � for linearindependence. Next we find
� �\� E � � E � � �� � � �� � � � � � (57)For � � we get
� � �\� � �A��� � � � � � � � � � ��� � � � � � � �� � � �� � � � � � � � ��� � � � � (58)Since � � � � , � � ��� � � � � � � � � � � � � . The vector �� can be expressed in the form��5�\�� � � ���� � � � $� � ���� � ��� � � � � � � � � � � � � / � �� � � �� � � � � � � / � � ��� � � � � � � � � � � � �
(59)
Therefore,
� � �A�� � �� � � � � � � � � � (60)33
-
Appendic C:
12.8.3 Gradient descent, neural network training and PCA
This appendix is intended to give the reader a more thorough understanding of the inner
workings of gradient descent and learning algorithms. In particular, we will see how
the weights of a neural network are essentailly a matrix that we train on some data by
minmising or maximising a cost function through gradient descent. If a multi-layered
perceptron (MLP) neural network is auto-associative (it has as many output nodes as input
nodes), then we essentially have the same paradigm as Blind Source Separation. The only
difference is the cost function.
This appendix describes relevant aspects of gradient descent and neural network theory.
The error back-propagation algorithm is derived from first principles in order to lay the
ground-work for gradient descent training an auto-associative neural network.
The neuron
The basic unit of a neural network is the neuron. It can have many inputs � and its outputvalue, � , is a function, � , of all these inputs. Figure 8 shows the basic architecture of aneuron with three inputs.
�
�
�
������
� �� �
� �� ���
��
�������.
�� � �� �
�
Figure 8: The basic neuron
For a linear unit, the function � , is the linear weighted sum of the inputs, sometimes knownas the activation
�, in which case the output is given by
��� � � ��� � � � (61)
For a nonlinear unit, a nonlinear function � , is applied to the linearly weighted sum ofinputs. This function is often the sigmoid function defined as
� � � � � � �� � � C � (62)34
-
The output of a non-linear neuron is then given by
� � � � � � ��� � � ��� � (63)
If the outputs of one layer of neurons are connected to the inputs of another layer, a neural
network is formed.
Multi-layer networks
The standard MLP consists of three layers of nodes, the layers being interconnected via
synaptic weights � ��
and �� � as shown in Figure 9. The input units simply pass all of the
input data, likewise the non-linear output units of the final layer receive data from each of
the units in the hidden layer. Bias units (with a constant value of� � �
), connect directly via
bias weights to each of the neurons in the hidden and output layers (although these have
been omitted from the diagram for clarity).
�
�
�
�
�
�
�
�
�
�
� � � �� � � �
� �� �
� �� �
� �� �
� �� �
� �� �
� �
� � � �� � � �
� �� �
� �� �
� �� �
� � � � � � � � � �
� � � �� � � �
� �� � � � � � � � � �
� � � � � � � � � �
� � � � � � � � � �
� � � � � � � � � �
� � � � � � � � � � � �� �
� �� �
� �� � � �
� � � �� �
� � � � � � � � � �
� � � � � � � � � �
� �� � � � � � � � � � � � � � � � � � � � � �Input units Output units
Hidden units� ��
�� �
Layer � Layer � Layer �
Figure 9: Schematic of a 5-3-2 multi-layer perceptron. Bias units and weights are omitted
for clarity.
Learning algorithm - Gradient descent
The input data used to train the network, now defined as ��� for consistency of notation, isfed into the network and propagated through to give an output � � given by
� � � � � � � � � � � � � � ��� ��� �N��� (64)
Note that our � ’s are now � � ’s and our sources are ��
(or � � if multi-layered network isused). During training, the target data or desired output, � � , which is associated with
35
-
the training data, is compared to the actual output � � . The weights, � � � and � � � , are thenadjusted in order to minimise the difference between the propagated output and the target
value. This error is defined over all training patterns, � , in the training set as� � �� � 0
� � � � � � � � � � � � � � � � � ��� �� �� � � � � � � (65)
Note that in the case of ML-BSS the target is one of the other output vectors (source
estimates) and the error function � , is the log likelihood. We must therefore sum � overall the possible pairs of output vectors.
Error back-propagation
The squared error, � , can be minimised using the method of gradient descent [29]. Thisrequires the gradient to be calculated with respect to each weight, � �
�and �
� � . The weightupdate equations for the hidden and output layers are given as follows:
� ��� � �� � � � ��� �� � � � � �� � � � (66)� ��� � �� � � � ��� �� � � � � �� � � � (67)
The full derivation of the calculation of the partial derivatives, ��
��� � � and ��
��� ��� , is given inAppendix 12.8.4. Using equations 106 and 98 we can write:
� ��� � �� � � � ��� �� � � � � � � � (68)� ��� � �� � ��� ��� �� � � � � � � � (69)
where � is the learning rate and � � and � � are given below:� � � ��� � � � � � � � � � � � � � (70)� � � � � � � � � � � � � � � � � � (71)For the bias weights, the � � and �
�in the above weight update equations are replaced by
unity.
Training is an iterative process (repeated application of equations 68 and 69) but, if con-
tinued for too long, the network starts to fit the noise in the training set and that will have
a negative effect on the performance of the trained network on test data. The decision on
36
-
when to stop training is of vital importance but is often defined when the error function
(or it’s gradient) drops below some predefined level. The use of an independent validation
set is often the best way to decide on when to terminate training (see Bishop [29] p262
for more details). However, in the case of an auto-asscociative network, no validation set
is required and the training can be terminated when the ratio of the variance of the input
and ouput data reaches a plateau. (See [50, 51] ).
Auto-associative networks
An auto-associative network has as many output nodes as input nodes and can be trained
using an objective cost function measured between the inputs and ouputs; the target data
is simply the input data. Therefore, no labelling of the