What is Independent Component Analysis? - Temple
Transcript of What is Independent Component Analysis? - Temple
What is Independent Component Analysis?
Alan Julian Izenman
Temple University
E-mail address: [email protected]
For David R. Brillinger
July 2003
Abstract
This article describes a relatively new research topic called independent componentanalysis (ICA), which is becoming very popular in the signal processing literature andamongst those working in machine learning and data mining. The primary focus ofICA is to resolve the classical problem of blind source separation (BSS), in which anunknown mixture of nonGaussian signals is decomposed into its independent compo-nent signals. The classical example of BSS is the so-called cocktail-party problem,where the mixture consists of simultaneous speech signals recorded by a number of mi-crophones. Important applications include biomedical signal processing (usually brainwave activity in the form of EEG and MEG tracings), audio signal separation (mixedspeech and music signals), telecommunications (a confusion of signals transmitted bymultiple users of mobile phones), financial time series (portfolios of stocks), and datamining (text document analysis). The ICA methodology has much in common withthat of projection pursuit (PP).
KEY WORDS: Blind source separation; Brain imaging; Cumulants; FastICA al-gorithm; Financial time series; Independent factor analysis; Kernel ICA; Kurtosis;Maximum likelihood; Mutual information; Negentropy; Projection pursuit; Signal pro-cessing; Supergaussian and subgaussian components; Time series analysis.
2
1. INTRODUCTION
Independent component analysis (ICA) is a multivariate statistical technique which seeks
to uncover hidden variables in high-dimensional data. As such, it belongs to the class of latent
variable models, such as factor analysis (FA). Furthermore, because of its success in analyzing
signal processing data, ICA can also be regarded as a digital signal transform method.
Although the concept of ICA was introduced in 1982 in a neurophysiological context, its
name was coined by Herault and Jutten (1986). See Jutten (2000) for the early history.
Since then, theoretical insights, computational algorithms, and new applications have been
developed to enhance and understand the ICA technique. Several books (e.g., Cichocki and
Amari, 2003; Hyvarinen, Karhunen, and Oja, 2001; Lee, 1998) and edited volumes (e.g.,
Roberts and Everson, 2001; Girolami, 2000; Nandi, 1999) have appeared and a huge number
of articles have been published on the topic. There is also an international workshop on ICA
and related topics held annually in different countries. As an indication of the popularity
today of ICA, a Google search on ”independent component analysis” resulted in almost 1.7
million hits. Yet, we see very little attention paid to ICA in the statistical literature; an
isolated reference is Hastie, Tibshirani, and Friedman (2001, sec. 14.6).
In its most basic form, the ICA model is assumed to be a linear mixture of a number of
unknown hidden source variables, where the mixing coefficients are also unknown. A totally
“blind” approach to determining both the hidden variables and the mixing coefficients solely
from the observed multivariate data fails because the problem as stated is not well-defined.
To build more structure into the problem, we require the hidden variables to be mutually in-
dependent and also (with at most one exception) non-Gaussian. ICA is actually an amalgam
of several related approaches to this problem, and these approaches are characterized by the
types of assumptions visited upon the distributions of the independent source variables and
whether or not a separate noise component should be included in the ICA model.
The signal processing problem of blind source separation (BSS), in which an unknown
mixture of non-Gaussian signals is to be decomposed into its independent component sig-
nals, is closely related to ICA. BSS is similar to the classical electrical engineering problem of
source separation, but in BSS there is no knowledge of the signals that make up the mixture.
The best-known example of BSS is the so-called “cocktail-party problem” (Cherry, 1953). In
this problem, m people are speaking simultaneously at a party, and each of r microphones
placed in the same room at different distances from each speaker records a different mix-
3
ture of the speakers’ voices at n time points. The question is whether, based upon these
microphone recordings, we can separate out the individual speech signals of each of the m
speakers. Despite the fact that the cocktail-party problem assumes the speakers babble on
independently without considering the presence of other partygoers (who usually speak in
clustered groups), it does give a fairly simplistic explanation of how one can envision BSS
problems.
Amongst the cocktail-party-type problems, ICA has been extensively applied to the study
of the human brain, whose functions “provide the basis of perception and cognition and
underlie emotion and creative expression” (Pechura and Martin, 1991, p. 27). Patterns of
human brain-wave activity can be viewed through noninvasive recordings made by r (usually
around 20, sometimes a lot more) electrodes placed evenly around a subject’s head during
different periods of consciousness and sleep. The electrodes capture a mixture of brain
waves from different areas of the brain, and it is the job of ICA to separate them into
individual source signals. In particular, electroencephalographic (EEG) recordings make it
possible to relate certain types of behavior to changes in the electrical activity of the cerebral
cortex; event-related potential (ERP) recordings are finely-tuned EEGs resulting from the
stimulation of specific visual, auditory, or sensory systems; and magnetoencephalographic
(MEG) recordings measure the magnetic fields that are generated by cortical activity. ICA
applied to EEG, ERP, or MEG recordings assumes that the source signals are statistically
independent and stationary, and that the mixing process is linear and instantaneous.
ICA has also been found to be successful in analyzing the extremely large datasets ob-
tained from functional magnetic resonance imaging (fMRI) experiments (McKeown, Makeig,
Brown, Jung, Kindermann, Bell, and Sejnowski, 1998). Other applications of ICA include
extracting structure from financial stock returns (Back and Weigand, 1997), mapping the
cosmic microwave background anisotropy from satellite radiometric sky maps (Salerno, Be-
dini, Kuruoglu, and Tonazzini, 2002), separating out the effects of major volcanic eruptions
from climate and temperature data (Fodor and Kamath, 2003), Web image retrieval and clas-
sification, wireless communications and speech recognition systems, and agricultural remote
sensing images. Classification of microarray gene expression profiles using ICA methods has
also become a popular research issue.
The technical aspects of ICA in its basic formulation are remarkably similar to those of
exploratory projection pursuit (PP), which was developed over a decade earlier than ICA,
4
first by Kruskal (1969, 1972) and then by Friedman and Tukey (1974) (who named it). Af-
ter a brief hiatus, there followed a flurry of activity in which PP was studied by Tukey and
Tukey (1981), Friedman and Stuetzle (1982), Huber (1985), Friedman (1987), Jones and Sib-
son (1987), Hall (1989), and Cook, Buja, and Cabrera (1993). Because most low-dimensional
projections of high-dimensional data are approximately Gaussian-distributed (Diaconis and
Freedman, 1984), we should not expect such projections to show unusual patterns or struc-
ture. PP was, therefore, designed to seek out “interesting” low-dimensional (typically, one-
or two-dimensional) orthogonal projections of multivariate data, where the least-interesting
feature is defined to be Gaussianity. In its original incarnation, PP was driven by the desire to
expose specific features (e.g., local concentration, clustering into distinct groups, clumpiness,
or clottedness) which indicated non-Gaussianity of the data. Because an exhaustive search
for such features was clearly impossible, the search was automated. Indexes of interesting-
ness were created and optimized numerically in an attempt to imitate how users intuitively
(by eye) chose interesting projections (see Friedman’s discussion of Huber, 1985). This for-
mulation was later replaced by a search for projections that are as far from Gaussianity as
possible.
ICA and PP methodologies look at the same data in very different ways, yet they both use
the same (or similar) computational tool (numerically optimizing an objective function) to
achieve a common statistical goal of finding low-dimensional, non-Gaussian projections of the
data. Differences between ICA and PP derive from the different problems they were originally
built to solve. For example, ICA was introduced to resolve a separation problem, starting
with the estimation of independent components, while PP was designed as an exploratory
tool for visualization, focussing on dimensionality reduction of a high-dimensional space.
While much of the PP methodology has been incorporated into the ICA toolkit, there has
been little cross-pollination in the other direction. Recent enhancements of the ICA model
which take into account time structure and nonlinearity of the mixing coefficients further
distinguish ICA from PP.
This paper provides an expository account of ICA. In Section 2, the important step of
preprocessing the data using centering and sphering operations is described. Then, in Sec-
tion 3, we discuss the general type of problem for which ICA has been applied. In Section
4, ways of measuring non-Gaussianity are discussed, including skewness- and kurtosis-based
measures and relative entropy, which is a normalized version of entropy. Because entropy
5
is difficult to estimate directly, and because its major component is an unknown probabil-
ity density function, we, first, need to estimate the underlying source density. Two types
of density estimates are considered, an orthogonal polynomial approximation using a trun-
cated Gram-Charlier expansion, which leads to a moment-based index, and a nonpolynomial
approximation which is used in a FastICA algorithm.
In Section 5, we deal with the linear-mixing, noiseless, ICA model. Specifically, we de-
scribe a FastICA algorithm for extracting a single source component and two extensions
of that algorithm for extracting multiple independent source components. We also show
two methods of computing maximum-likelihood estimates of the independent source com-
ponents, one using the EM algorithm and another using the FastICA algorithm. In Section
6, we discuss the linear-mixing, noisy, ICA model. A special case of this model is the well-
known factor-analysis model, and we describe the principal components approach and the
maximum-likelihood approach using the EM algorithm. We also discuss the independent
factor analysis model, which is a hybrid between factor analysis and ICA. In Section 7, we
show the close relationship between ICA and projection pursuit.
2. CENTERING AND SPHERING
Suppose we observe a random r-vector, X = (X1, · · · , Xr)τ , of correlated measurements
with mean r-vector E(X) = µ and (r×r) covariance matrix cov(X) = ΣXX . Prior to carrying
out PP or ICA applications, we preprocess X so that its r components have commensurate
scales (see, e.g., Tukey and Tukey, 1981).
We do this by first centering X so that its components have zero mean, and then by
sphering (or whitening) the result so that its components are uncorrelated with unit variances.
Sphering is a linear transformation which removes all traces of scale and correlation structure
from X. From the spectral decomposition of the covariance matrix, ΣXX = UΛUτ , where
the columns of the orthogonal matrix U are the eigenvectors of ΣXX , and Λ is a diagonal
matrix with diagonal elements the eigenvalues of ΣXX . The columns of U and the diagonal
elements of Λ are ordered by the decreasing magnitudes of the eigenvalues of ΣXX . The
(centered and) sphered version of X is given by
X← Σ−1/2XX (X− µ), (1)
where Σ−1/2XX = UΛ−1/2Uτ . This transformation is equivalent to computing the principal
6
components of X − µ and then rescaling the principal components to have unit variance.
In other words, we can write (1) as X← Λ−1/2Uτ (X− µ). If ΣXX has less than full rank,
only those principal components having nonzero variance would be retained (and rescaled).
A benefit of sphering X is that it is now affine invariant, with µ = 0 and ΣXX = Ir.
In practice, µ and ΣXX will be unknown. Thus, we use n independent observations,
X1, . . . ,Xn, on X to compute X = n−1∑ni=1 Xi and ΣXX = n−1∑n
i=1(Xi − X)(Xi − X)τ ,
respectively. Centering and sphering the data using Xi ← Σ−1/2XX (Xi − X), i = 1, 2, . . . , n,
transform an elliptically-shaped symmetric cloud of points into a spherically-shaped cloud.
To reduce the dimensionality of the data, it is commonly advocated that only the first J < r
sphered variables be retained, where J is chosen to explain a certain (high) proportion of
the total variance (see, e.g., Friedman, 1987). If outliers are present, robust versions of the
sphering process are discussed in Tukey and Tukey (1981).
We note that the practice of sphering is somewhat controvertial. Although sphering has
computational and interpretational advantages (see, e.g., Friedman, 1987), arguments have
been made that the act of sphering is too closely tied to underlying unimodal (and especially
Gaussian) distributions, an environment we wish to avoid (see, e.g., the comments of Gower,
and Hastie and Tibshirani in the discussion of Jones and Sibson, 1987). However, we follow
PP and ICA practice by assuming that the components of X have been preprocessed to be
mutually uncorrelated, each having zero mean and unit variance.
3. THE GENERAL ICA PROBLEM
In its most general form, the ICA model assumes that X is generated by
X = f(S) + e, (2)
where S = (S1, · · · , Sm)τ is an (unobservable) random m-vector variate of sources whose
components {Sj} are independent latent variables each having zero mean, f : <m → <r
is an unknown mixing function, and e is an additive r-vector-valued noise component with
zero mean. Independence of the sources means that each individual source signal is thought
to be generated by a process unrelated to any other source signal. In general, it suffices to
assume that E(S) = 0 and cov(S) = Im.
The BSS problem is to invert f and estimate S. As it stands, this problem is ill-posed
and needs some additional constraints or regularization on S, f , and e. If we take f to be a
7
linear function, f(S) = AS, where A is a “mixing” matrix, then (1) is described as a linear
ICA model, while if f assumed to be nonlinear, then (1) is described as a nonlinear ICA
model. Most applications of ICA assume no additive noise e, and that all noise in the model
is to be associated with the components of the random vector S. Such a model is referred to
as noiseless ICA. If e is included in (1), the model is described as noisy ICA.
It turns out that the noiseless ICA model with linear mixing, X = AS, can only be solved
if the vector S with independent components is not Gaussian. We can see this by assuming
the contrary. Suppose that the sources, S1, . . . , Sm, are independent and Gaussian, each
with zero mean and unit variance. Their joint density is given by qS(s) =∏m
j=1 qSj (sj) =
(2π)−m/2e−‖s‖2/2, where ‖s‖2 =∑
j s2j . If the mixing matrix A is square (m = r) and,
hence, orthogonal (Ir = ΣXX = AAτ , so that A−1 = Aτ ), then one can show that the
density of X = AS is given by pX(x) = (2π)−m/2e−‖Aτx‖2/2|det(Aτ )|. But A is orthogonal,
and so ‖Aτx‖2 = ‖x‖2 and |det(Aτ)| = 1. Thus, the density of X reduces to pX(x) =
(2π)−m/2e−‖x‖2/2, which is identical to the density of S, so that the orthogonal mixing matrix
A cannot be identified for independent Gaussian sources. Thus, it makes sense to require
that, with the exception of one component, the remaining independent source components
cannot be Gaussian distributed.
There are a number of ways of estimating this type of ICA model while ensuring that the
components of S are as statistically independent and non-Gaussian as possible. Usually, we
are in possession of n repeated r-variate observations, Xi = (Xi1, · · · , Xir)τ , i = 1, 2, . . . , n,
on X, which constitute our data set. From this, our goal is to recover the m independent
sources, Si = (Si1, · · · , Sim)τ , i = 1, 2, . . . , n, which generated the data through Xi = ASi,
i = 1, 2, . . . , n. Several efficient computational algorithms have been created to reach this
goal.
In most ICA applications, X is regarded as an r-vector-valued stochastic process X(t) =
(X1(t), · · · , Xr(t))τ (e.g., audio or music signals, EEG tracings, seismic recordings), where
t is a time or index parameter. We usually assume that X(t) is an unknown non-Gaussian
process with zero mean. In the linear noiseless ICA model with temporally-structured sources
and static mixing, the model is written as X(t) = AS(t), where S(t) = (S1(t), · · · , Sm(t))τ
is assumed to be an m-vector of stationary sources with A static (i.e., instantaneous, non-
time-varying, without trends or delays), 1 ≤ t ≤ n. For example, in the cocktail-party
problem, Si(t) is the tth sound spoken by the ith speaker (i = 1, 2, . . . , m) and Xj(t) is the
8
tth acoustic recording made by the jth microphone (j = 1, 2, . . . , r). In this formulation, the
ICA problem is closely related to the deconvolution of time series; see, for example, Donoho
(1981), who discusses at length the single-channel deconvolution problem and its application
to seismology. Extensions to the multi-channel case have also been studied.
If the mixing matrix A = A(t) is allowed to depend upon the time parameter, then we
refer to the model as dynamic mixing. By incorporating the temporal structure of the sources
into the ICA model, there is a good chance that the separation properties of the analysis
can be improved. In our description of ICA models, we omit the explicit dependence of X
on t unless specifically needed in the exposition.
4. LINEAR MIXING: I. NOISELESS ICA
4.1 The Model
The simplest form of the ICA model is the linear mixing version with no additive noise,
usually called the noiseless (or classical) ICA model. In this scenario, X is modelled as
X = AS, (3)
where the source components {Sj} are assumed to be statistically independent and A is a
full-rank (r×m) mixing matrix with unknown coefficients. Usually, m ≤ r. For model (33),
where the sources have mean zero, X has mean zero and covariance matrix ΣXX = AAτ .
The BSS (and ICA) problem for model (33) is to estimate A and recover S. Note that the
model (33) does not identify A and S uniquely, for if S∗ = TτS and A∗ = AT, where T is
an orthogonal (m×m)-matrix, then X∗ = A∗S∗ has unchanged mean and covariance matrix
(ΣX∗X∗ = A∗A∗τ = AAτ = ΣXX).
If the number of sources is unknown, it is generally assumed that m < r. In situations
where A is not square but of full-rank, there exists an inverse mapping W = (w1, · · · ,wm)τ ,
usually termed a separating or unmixing matrix, such that
Y = WX = (wτ1X, · · · ,wτ
mX)τ = (Y1, · · · , Ym)τ (4)
approximates the source component vector S. Our goal is to determine W and, hence, Y. If
A were known, then the solution would be given by Y = (AτA)−1AτX.
An important special case of model (33) is the square mixing model, where the number of
independent sources is equal to the number of measurements (i.e., m = r), a simplification
9
studied by Bell and Sejnowski (1995). As we saw above, if X has been centered and sphered,
then the resulting square mixing matrix A in model (33) is orthogonal. In this case, the
number of elements of A to be determined is reduced from r2 to r(r − 1)/2. The goal
is to determine an orthogonal A and recover S using Y = WX, where W = Aτ , and the
elements, Y1 = wτ1X, . . . , Ym = wτ
mX, of Y are taken to be independent and as non-Gaussian
as possible.
4.2 Objective Functions
The general strategy behind ICA is to set up an appropriate objective (or contrast) func-
tion (also called a projection index in PP) to judge the merit of a particular m-dimensional
projection of multivariate data, and then use an optimization algorithm to find the global
and local maxima of that objective function over all such m-dimensional projections of the
data. For a given m = 1, 2, or 3, the optimization step determines the most informative
m-dimensional projection of the data. For numerical optimization purposes, we want the
objective function to possess certain desirable computational and analytical properties. The
most desirable property is that of affine invariance (location and scale invariance); exam-
ples of affine invariant objective functions include absolute cumulants, standardized Fisher
information, and relative entropy.
4.3 Polynomial-Based Indexes.
4.3.1 One-Dimensional Indexes. First, we assume that m = 1, so that Y = wτX is a
single continuous random variable having probability density function qY (y). The projection
indexes which drive PP can also be used as objective functions for ICA. These indexes take
the general form of weighted versions of integrated squared error,
I(Y ) =∫
[φ(y)− qY (y)]2w(y)dy, (5)
where w(y) is a given weight function on <. The index I(Y ) measures the extent of departure
of the density qY (y) from the standard Gaussian density, φ(y) = (2π)−1/2e−y2/2, having zero
mean and unit variance.
An index such as (5) can be expressed in terms of the coefficients of orthogonal poly-
nomial expansions of the density function qY (y). If qY (y) is a (square-integrable) density
function, then it can be represented as a convergent orthogonal series expansion, qY (y) =
10
∑∞k=0 αkPk(y), y ∈ <, where {Pk} is a complete orthonormal system of functions on the real
line < (or some subset thereof) (i.e.,∫Pi(y)Pj(y)dy = δij, the Kronecker delta), Pk is a
polynomial of degree k, and the {αk} are coefficients defined by αk = Eq{Pk(Y )}. There are
many different types of {Pk}; see, e.g., Abramowitz and Stegun (1972, Chapter 22). For our
purposes, we need only mention two versions of Hermite polynomials:
• Chebyshev-Hermite polynomials: Hek(y) = (−1)key2/2Dke−y2/2, k = 0, 1, 2. . . .. The
{Hek(y)} form a complete orthonormal basis on < with respect to the weight function
φ(y) in the sense that ∫φ(y)Hei(y)Hej(y)ds = j! δij. (6)
In this case, Pk(y) = (k!)−1/2Hek(y)[φ(y)]1/2, k = 0, 1, 2, . . .. The first few Chebyshev-
Hermite polynomials are given by He0(y) = 1, He1(y) = y, He2(y) = y2−1, He3(y) =
y3 − 3y, and He4(y) = y4 − 6y2 + 3.
• Hermite polynomials: Hk(y) = (−1)key2Dke−y2
, k = 0, 1, 2, . . .. The {Hk(y)} form a
complete orthonormal basis on < with respect to the weight function [φ(y)]2 in the
sense that ∫[φ(y)]2Hi(y)Hj(y)dy = δij2
j−1j!π−1/2. (7)
In this case, Pk(y) = (2k−1k!π−1/2)−1/2Hk(y)φ(y), k = 0, 1, 2, . . .. The first few Hermite
polynomials are given by H0(y) = 1, H1(y) = 2y, H2(y) = 4y2− 2, H3(y) = 8y3− 12y,
and H4(y) = 16y4 − 48y2 + 12.
The symbol Dk represents the derivative dk/dyk of whatever immediately follows.
In devising a projection index, Friedman (1987) noted that Y is standard Gaussian with
density φ(y) if and only if U = 2Φ(Y )−1, where Φ(Y ) =∫ Y−∞ φ(y)dy, is uniformly distributed
on the interval [−1, 1]. Thus, the density of U , qU (u), say, could be compared to the uniform
density using integrated squared error (ISE),
IF (Y ) =∫ 1
−1[qU(u)− 1
2]2du =
∫ 1
−1[qU (u)]2du− 1
2. (8)
The further qU(u) is from the uniform density, the further Y would be from Gaussianity, and
so IF (Y ) would, therefore, measure the extent of non-Gaussianity. Friedman approximated
IF by expanding qU (u) in (6) as a truncated sum of Legendre polynomials, where the number
of terms in the truncated expansion determines how much smoothing is allowed by the
11
approximation. A bivariate version of PP using an extension of the objective function (6) was
also derived and is publicly available from StatLib as the FORTRAN subroutine ppdeaux.
Hall (1989) (and later Cook, Buja, and Cabrera, 1993) showed that if Friedman’s index
(9) is transformed back to the original scale, it can be reexpressed as
IF (Y ) = 12
∫[φ(y)− qY (y)]2
1
φ(y)dy, (9)
where qY (y)/[φ(y)]1/2 is assumed to be square-integrable. Based on the form of (8), Hall
noted that unless the tails of qY (y) decrease fast enough, IF (Y ) can be infinite; thus, for
heavy-tailed qY (y), IF (Y ) will not be very useful as a measure of departure from Gaussianity.
Friedman, however, specifically used the index IF (Y ) to search for “projected distributions
that exhibit clustering (multimodality) or other kinds of nonlinear associations,” rather than
use it to identify heavy-tailed departures from Gaussianity.
The Gram-Charlier expansion of qY (y) is given by
qY (y) = φ(y)∞∑
k=0
ak
k!Hek(y) (10)
where ak = Eq{Hek(Y )} and Hek(y) is the Chebyshev-Hermite polynomial of order k
(Thisted, 1988, p. 285). Substitute (10) for qY (y) in (9), then expand the squared term
in the integrand and use the orthogonality condition (7). From the definition of ak, and
because Y has zero mean and unit variance, it follows that a0 = 1, a1 = 0, and a2 = 0.
Thus,
IF (Y ) = 12
∞∑
k=3
a2k
k!(11)
The index IF (Y ) can be approximated by truncating the sum to the first K terms,
IKF (Y ) = 1
2
K∑
k=3
a2k
k!. (12)
Given i.i.d. observations, Y1, . . . , Yn, on Y , we can estimate the {ak} by the sample averages,
ak = n−1n∑
i=1
Hek(Yi), k = 3, 4, . . . , K, (13)
and then substitute (13) into (12) to get the estimated index IKF (Y ).
The index IKF (Y ) can also be expressed in terms of the cumulants of Y . If Y has
zero mean, then the first four cumulants of Y are given by: κ1 = 0, κ2 = E(Y 2), κ3 =
E(Y 3), κ4 = E(Y 4) − 3[E(Y 2)]2. It follows that a3 = κ3 = κ3(Y ) is the skewness of Y
12
and a4 = κ4 = κ4(Y ) is the kurtosis of Y . If κ3 = 0, then the density of Y is symmetric;
otherwise, not. The fourth cumulant, κ4, measures the flatness vs. peakedness of the density
of Y . A zero-mean Gaussian Y has κ3 = κ4 = 0. Any Y with κ4 = 0 is called mesokurtic,
but examples of such densities (other than the Gaussian) are rare. If κ4 > 0, we say that
Y is super-Gaussian (or leptokurtic or approximately sparse) with a density which is highly
peaked at 0 and has heavier tails than the Gaussian (e.g., Laplacian or double-exponential
density), while if κ4 < 0, then Y is called sub-Gaussian (or platykurtic) with a density which
may be flat (or multimodal) over much of the range of Y and have very small values at the
extremes (e.g., uniform density). Setting K = 4, and estimating κ3 and κ4 by the sample
estimates κ3 = κ3(Y ) and κ3 = κ3(Y ), respectively, (8) can be estimated by
I4F (Y ) =
κ3(Y )
12+κ4(Y )
48, (14)
which is the moment-based projection index of Jones and Sibson (1987).
In practice, although the index (14) can be computed very quickly, the skewness and
kurtosis components are primarily influenced by tail structure (i.e., outliers) in the data. The
ironic feature of Friedman’s proposed index IF (Y ) is that rather than force attention away
from the tails as intended, it turns out to do exactly the opposite. Interestingly enough, it
turns out that outliers in the projected data are not at all unusual. In simulation experiments
using a moment-based index similar to (14) for PP (see Friedman and Johnstone’s discussions
of Jones and Sibson, 1987), outliers were observed to appear repeatedly in projections of even
well-behaved multivariate Gaussian data. Furthermore, there is no obvious way to robustify
the index (14).
Given the potential robustness problems inherent in using (10) as a projection index,
Hall (1989) proposed a variation on the theme of IF by studying the integrated squared
error between qY (y) and the Gaussian density φ(y),
IH(Y ) =∫
[φ(y)− qY (y)]2dy. (15)
This index can also be expressed in terms of the coefficients of certain orthogonal functions.
Expanding qY (y) in terms of the Hermite polynomials {Hk(y)} yields
qY (y) = φ(y)∞∑
k=0
bkγ−1/2k Hk(y), (16)
where bk = bk(Y ) = γ−1/2k Eq{Hk(Y )φ(Y )} and γk = 2k−1k!π−1/2. Substituting (16) into
13
(15), expanding the squared integrand, and then simplifying using (7) yields
IH(Y ) = (b0(Y )− γ1/20 )2 +
∞∑
k=1
[bk(Y )]2, (17)
where γ0 = (2π1/2)−1 ≈ 0.283.
An interesting result is obtained if we truncate (17) at k = 0. Then,
I0H(Y ) = (b0(Y )− γ1/2
0 )2
= γ−10 (Eq{φ(Y )} − E{φ(Z)})2, (18)
where Z is standard Gaussian and, using (7), E{φ(Z)} = (2π1/2)−1. We shall see a general-
ized form of this objective function (18) again in Section 4.4.2 (see (51)).
Given i.i.d. observations, Y1, . . . , Yn, on Y , the {bk} can be estimated by the sample
averages,
bk = bk(Y ) = γ−1/2k n−1
n∑
i=1
Hk(Yi)φ(Yi), k = 0, 1, 2, . . . . (19)
Substituting (18) into (17) and truncating the sum to the first K terms yields the estimate
IKH (Y ) = (b0(Y )− γ1/2
0 )2 +K∑
k=1
[bk(Y )]2. (20)
Under certain regularity conditions, Hall showed that IKH (Y ) is a useful measure of departure
from Gaussianity, with the most interesting projection direction maximizing IKH (Y ).
A further modification of Friedman’s and Hall’s proposed projection indexes was proposed
by Cook, Buja, and Cabrera (1993),
ICBC(Y ) =∫
[φ(y)− qY (y)]2φ(y)ds, (21)
who put more weight around the center of the distribution, rather than at the tails. This
time we apply the Chebyshev-Hermite expansion to both qY (y) and φ(y). We can write
qY (y) =∞∑
k=0
ckk!Hek(y), φ(y) =
∞∑
k=0
dk
k!Hek(y), (22)
where the coefficients {ck} are defined by ck = ck(Y ) = Eq{Hek(Y )φ(Y )}, the coefficients
{dk} are given by d2m = (−1)m√
(2m)!/m!22m+1√π and d2m+1 = 0, m = 0, 1, 2, . . ., and
Hek(y) is the Chebyshev-Hermite polynomial of order k. Substituting (21) into (20), ex-
panding the squared integrand and using the orthogonality condition (8), ICBC(Y ) can be
written as
ICBC(Y ) =∞∑
k=0
(dk − ck(Y ))2
k!. (23)
14
It is not difficult to show that if we truncate (23) to the first term (k = 0), then I0CBC can
be expressed as
I0CBC(Y ) = (d0 − c0(Y ))2
= (Eq{φ(Y )} − E{φ(Z)})2, (24)
which is proportional to (18).
Given i.i.d. observations, Y1, . . . , Yn on Y , we can estimate the unknown {ck} by the
sample averages,
ck = ck(Y ) = n−1n∑
i=1
Hek(Yi)φ(Yi). (25)
The index ICBC(Y ) is then estimated by substituting (23) for ck in (22), truncating the sum
to the first K terms, and setting
ICBC(Y ) =K∑
k=0
(dk − ck(Y ))2
k!. (26)
Some attention has been focussed upon an appropriate choice of K, which also serves as a
smoothing parameter. Cook et al (1993) surprisingly found that small values of K (K = 0
or K = 1) turned out to be the most interesting, especially in discovering projections with
a “hole” in the middle, or skewness when it exists.
4.3.2 Two-Dimensional Indexes. Next, we can obtain a projection index for m = 2 by
using the same ideas as for a one-dimensional index. Let (Y1, Y2) be a bivariate projection
of X, where
Y1 = wτ1X, Y2 = wτ
2X, (27)
4.4 Relative Entropy
The entropy of a random variable was introduced by Claude E. Shannon in 1948 and
has since become a valuable concept in information theory. See, for example, Gray (1990),
Cover and Thomas (1991). The entropy of the random variable Y gives us a notion of
how much information is contained in Y . Essentially, entropy is largest when Y is most
unpredictable. If Y is a continuous random variable with probability density function qY (y),
then the (differential) entropy H(Y ) of Y is defined by
H(Y ) = −∫qY (y) log qY (y)dy. (28)
15
Amongst all random variables having equal variance, the largest value of H(Y ) occurs when
Y has a Gaussian distribution. Small values of H(Y ) occur when the distribution of Y is
concentrated on specific values. Jones and Sibson (1987) had the idea to use the concept of
entropy as a measure of non-Gaussianity.
If we normalize H(Y ) so that it has the value zero for a Gaussian variable and otherwise
is always nonnegative, we arrive at relative entropy (also called negentropy) defined by
J (Y ) = H(Z)−H(Y ), (29)
where Z is a Gaussian random variable having the same variance as Y (Cover and Thomas,
1991). If Z has mean 0 and variance 1, then,
H(Z) =1
2[1 + log 2π] ≈ 1.419. (30)
An important property of relative entropy (but not of differential entropy) is that it is
invariant under linear invertible transformations (Comon, 1994): if Y is an m-vector with
mean 0 and covariance matrix Σ, and if X is an r-vector such that X = AY, then J (X) =
J (Y).
Differential entropy turns out to be difficult to compute, due mainly to the fact that the
probability density function qY (y) is, in principle, unknown. Attempts have been made to
estimate functionals of a density, and especially entropy, using a nonparametric estimate of
qY (y), which then gets used as a “plug-in” estimator (Izenman, 1991), but such computations
can be notoriously slow. More efficient approximations to J (Y ) involve either higher-order
cumulants or nonpolynomial expansions of the density function qY (y) in (28).
4.4.1 Polynomial Approximation. From (10), the Gram-Charlier expansion of the den-
sity qY (y) can be written as
qY (y) = φ(y)(1 + ε(y)), (31)
where
ε(y) =∞∑
k=3
ak
k!Hek(y). (32)
Assuming that qY (y) ≈ φ(y), then, expanding log(1 + ε) in a Taylor series,
log qY (y) = log φ(y) + log(1 + ε(y)) = log φ(y) + ε(y)− 1
2[ε(y)]2 +O([ε(y)]3). (33)
16
Substituting (31) into (28), while using (33) and (6), we have
H(Y ) = −∫φ(y)(1 + ε(y))
(log φ(y) + ε(y)− 1
2[ε(y)]2 +O([ε(y)]3)
)dy
= −∫φ(y)
(1 +
∞∑
k=3
ak
k!Hek(y)
)×
log φ(y) +
∞∑
k=3
ak
k!Hek(y)−
1
2
(∞∑
k=3
ak
k!Hek(y)
)2
+O([ε(y)]3)
dy
= H(Z)− 1
2
∞∑
k=3
a2k
k!+O([ε(y)]3). (34)
If we truncate the series in (34) at k = 4, then we have the result that
J (Y ) ≈ κ23(Y )
12+κ2
4(Y )
48, (35)
which again (see (14)) is the moment-based projection index of Jones and Sibson (1987).
4.4.2 Nonpolynomial Approximation. To overcome the data-sensitivity of the moment-
based index (35), Hyvarinen (1998) used instead a nonpolynomial function to maximize the
entropy H(Y ) of Y . Suppose Gi(Y ), i = 1, 2, . . . , N , are different nonpolynomial functions
of Y which (like the Hermite polynomials) form an orthonormal system with respect to the
standard Gaussian density φ,
∫φ(y)Gi(y)Gj(y)ds = δij, (36)
and which also are orthogonal to all polynomials of up to second order,
∫φ(y)Gi(y)y
kdy = 0, k = 0, 1, 2. (37)
The orthogonality constraints (36) and (37) can always be satisfied by using ordinary Gram-
Schmidt orthonormalization. We further assume that the expectations of the first N of the
Gi(S) are given by the following values:
E(Gi(Y )) =∫qY (y)Gi(y)dy = ci, i = 1, 2, . . . , N. (38)
Assuming also that Y has mean 0 and variance 1 yields two more constraints,
GN+1(y) = y, cN+1 = 0, (39)
GN+2(y) = y2, cN+2 = 1. (40)
17
It can be shown that the probability density, q0Y (y), which satisfies the constraints (36)–(40)
and also has the largest entropy amongst all such densities is given by
q0Y (y) = Ae
∑iaiGi(y), (41)
where A and the {ai} are constants to be determined from (38). If we again assume that
qY (y) ≈ φ(y), then for (32) to be close to e−y2/2, the only substantial coefficient has to be
aN+2 ≈ −1/2. We can rewrite (32) as follows:
q0Y (y) = A exp{−y2/2 + aN+1y + (aN+2 + 1/2)y2 +
N∑
i=1
aiGi(y)}
= A φ(y)(1 + aN+1y + (aN+2 + 1/2)y2 +N∑
i=1
aiGi(y)), (42)
where A = (2π)1/2A and where we used the approximation eε ≈ 1 + ε. Furthermore,
1 =∫q0Y (y)dy = A[1 + (aN+2 + 1/2)] (43)
0 = E(Y ) =∫q0Y (y)ydy = AaN+1 (44)
1 = E(Y 2) =∫q0Y (y)y2dy = A[1 + 3(aN+2 + 1/2)] (45)
ci =∫q0Y (y)Gi(y)dy = Aai, i = 1, 2, . . . , N. (46)
These equations are easily solved to give ai = ci, i+1, 2, . . . , N, aN+1 = 0, aN+2 = −1/2, and
A = 1. Substituting these values into (23) yields
q0Y (y) = φ(y)
(1 +
N∑
i=1
ciGi(y)
), (47)
which is referred to as the approximative maximum entropy density. Compare this represen-
tation with that given by (21). Hence, H(Y ) can be approximated by
H(Y ) ≈ −∫q0Y (y) log q0
Y (y)ds
= −∫φ(y)
(1 +
N∑
i=1
ciGi(y)
)[log φ(y) + log
(1 +
N∑
i=1
ciGi(y)
)]dy
≈ −∫φ(y) logφ(y)dy −
N∑
i=1
ci
∫φ(y)Gi(y) logφ(y)dy
−∫φ(y)
(1 +
N∑
i=1
ciGi(y)
)log
(1 +
N∑
i=1
ciGi(y)
)dy
18
= H(Z)−N∑
i=1
ci
∫φ(y)Gi(y) logφ(y)dy −
N∑
i=1
ci
∫φ(y)Gi(y)dy
−1
2
N∑
i=1
c2i
∫φ(y)G2
i (y)dy − o(
N∑
i=1
c2i
∫φ(y)G2
i (y)dy
)
= H(Z)− 0− 0− 1
2
N∑
i=1
c2i + o
(N∑
i=1
c2i
), (48)
where we have used the conditions (36) and (37), the expansion (1+ε) log(1+ε) = ε+ε2/2+
o(ε2) for ε small, and where Z ∼ N (0, 1). From (48) and (29), we have that
J (Y ) ≈ 1
2
N∑
i=1
(E{Gi(Y )})2 . (49)
All that remains now is to choose the functions {Gi(Y )}.The simplest choices of these functions have N = 1 or N = 2. Taking N = 2, first, we
can make G1 an odd function (G1(−y) = −G1(y), reflecting symmetry vs. asymmetry) and
G2 an even function (G2(−y) = G2(y), reflecting sub-Gaussian vs. super-Gaussian). One
can show that in this case the approximation (49) boils down to
J (Y ) ≈ β1 (E{G1(Y )})2 + β2 (E{G2(Y )} − E{G2(Z)})2 , (50)
where β1 and β2 are positive constants. If we take N = 1, the approximation becomes
J (Y ) ≈ β (E{G(Y )} − E{G(Z)})2 , β > 0, (51)
for any nonquadratic contrast function G, where Z ∼ N (0, 1). Note that (51) generalizes
the objective functions (18) and (24), where G is given by the standard Gaussian density φ.
The approximation (51) to relative entropy (negentropy) is used in the R and C code
implementation (Marchini, Heaton, and Ripley, 2003) of the FastICA algorithm, where β =
1. Choices of functional form of the G function (fun) used in the approximation include:
• logcosh : G(y) = 1α
log cosh(αy), 1 ≤ α ≤ 2 (usually, alpha=1),
• exp : G(y) = −e−y2/2.
The logcosh function has been found to be good for most types of ICA problems, while the
exp function is probably best for highly supergaussian source components where robustness
is a serious consideration. The logcosh function has also been used successfully as a flexible
family of Bayesian prior distributions, especially for the image reconstruction of photon
emission computed tomographic data (Green, 1990; Weir and Green, 1994; Weir, 1997).
19
4.5 The FastICA Algorithm
4.5.1 Extracting a Single Source Component. First, we detail the case of a single
(m = 1) source component (or one-dimensional projection), Y = wτX, where w is an
r-vector. Consider finding the direction w which maximizes the approximation (51) to rel-
ative entropy subject to the sphering constraint E{(wτX)2} = ‖w‖2 = 1 on the projection.
In other words, we wish to find that w which makes the distance between the density of
the one-dimensional projection Y = wτX and the Gaussian density as large as possible,
where distance is measured by relative entropy. Because the maxima of the relative entropy
J (wτX) in (51) are typically obtained at certain maxima of E{G(wτX)}, we set
F (w) = E{G(wτX)} − λ
2(‖w‖2 − 1), (52)
where λ is the Lagrangian multiplier. To maximize (52), the Newton-Raphson iterative
method (see, e.g., Thisted, 1988, Section 4.2.2) yields the iteration
w← w −(∂2F (w)
∂w2
)−1 (∂F (w)
∂w
). (53)
We, thus, need to find the first and second partial derivatives of F (w) with respect to w.
Differentiating (52) with respect to w yields
∂F (w)
∂w= E(Xg(wτX))− λw, (54)
where g = ∂G/∂w. The stationary values of the function F are found by equating (54) to
zero. Premultiplying both sides of the resulting equation by wτ yields
λ = E(wτXg(wτX)). (55)
Differentiating (54) with respect to w gives the approximate second derivative of F ,
∂2F (w)
∂w2= E(XXτg′(wτX))− λIr ≈ E(XX)τg′(wτX))− λIr = (E(g′(wτX))− λ)Ir, (56)
where we used the fact that X has been sphered. Substituting (54) and (56) into (53), the
iteration reduces to
w← w− E(Xg(wτX))− λwE(g′(wτX))− λ . (57)
If we set E1 = E(Xg(wτk−1X)) and E2 = E(g′(wτ
k−1X)), then (57) can be written as wk =
wk−1 − (E1 − λwk−1)/(E2 − λ) for the kth iteration. Multiplying both sides by λ − E2
20
Table 1. Nonquadratic density functions and their first and second derivatives to be used asinput to the FastICA algorithm. Note that for the logcosh density, 1 ≤ α ≤ 2.
density G(y) g(y) = G′(y) g′(y) = G′′(y)
logcosh 1α log cosh(αy) tanh(αy) α(1 − tanh2(αy))
exp −e−y2/2 ye−y2/2 (1− y2)e−y2/2
yields wk(λ − E2) = E1 − wk−1E2. Because we divide w by its norm ‖w‖ at each step of
the iterative procedure, the factor λ − E2 can be ignored. The iteration (57) is, therefore,
equivalent to
w← E(Xg(wτX))−wE(g′(wτX)). (58)
For the logcosh and exp densities, the functions g and g ′ are given in Table 1. Substituting
for g and g′ in (58) for either the logcosh or exp density as appropriate yields the FastICA
algorithm, which is given in Table 2.
The values of w can change substantially from iteration to iteration; this is because the
ICA model cannot determine the sign of w, so that −w and w become equivalent and define
the same direction. In light of this comment, “convergence” of the FastICA algorithm is
taken to have a different meaning than usual, and is taken here to mean that successive
iterative values of w (i.e., wk−1 and wk for some k) are oriented in the same direction (i.e.,
wτkwk−1 is very close to 1).
4.5.2 Extracting Multiple Independent Source Components. The FastICA package (Hurri,
Gavert, Sarela, and Hyvarinen, 1998) includes two different ways of extracting more than one
independent source component. Both methods (termed “deflation” and “parallel” methods)
repeatedly call the single component extraction algorithm of Table 2. Essentially, at each
step in the algorithic cycle:
deflation: the single component routine finds a new component, that new component is
orthogonalized using the Gram-Schmidt method with respect to all previously-found
21
Table 2. FastICA algorithm for determining a single source component.
1. Center the data to make the mean zero, and then whiten the result to give X.
2. Choose an initial version of the r-vector w with unit norm.
3. Choose G to be any nonquadratic density with first and second partial derivatives g and g ′,respectively. If the choice is either the logcosh or exp density, g and g ′ are given in the text.
4. Let w ← E(Xg(wτX)) − wE(g′(wτX)). In practice, the expectations are estimated usingsample averages.
5. Let w← w/‖w‖.
6. Iterate between steps 4 and 5. Stop when convergence is attained.
components, and then the resulting new component is normalized.
parallel: the single component routine is carried out in parallel for each independent com-
ponent to be extracted, and then a symmetric orthogonalization is carried out on all
components simultaneously.
The deflation method extracts independent components sequentially one-at-a-time, while
the parallel method extracts all the independent components at the same time. Both
algorithms are listed in Table 3.
4.6 ML Estimation
4.6.1 The EM Algorithm. Consider two vector-valued random variables X and S, where
we assume that X is observed while S is latent. For x ∈ Rr and s ∈ Rm, let the probability
density function of X be given by pθ(x) with model parameters θ, and let the prior density of
S be given by qη(s) with variational parameters η. The posterior density of S given X = x
is given by
pθ(s|x) =pθ(x, s)
pθ(x), (59)
where pθ(x, s) is the joint density of X and S. Taking logarithms of (40) yields the log-
likelihood,
L(θ|x) ≡ log pθ(x) = log pθ(x, s)− log pθ(s|x). (60)
22
Table 3. Two FastICA algorithms for extracting multiple independent source components.
Deflation algorithm
1. Center the data to make its mean zero, and then whiten the result to give X.
2. Decide on the number, m, of independent components to be extracted.
3. For k = 1, 2, . . . ,m,
• Initialize (e.g., randomly) the r-vector wk to have unit norm.
• Let wk ← E(Xg(wτkX))−wkE(g′(wτ
kX)) be the FastICA single-component update forwk, where g and g′ are given in Table 1. In practice, the expectations are estimatedusing sample averages.
• Use the Gram-Schmidt process to orthogonalize wk with respect to the previously cho-sen w1, . . . ,wk−1:
wk ← wk −k−1∑
j=1
(wτkwj)wj .
• Let wk ← wk/‖wk‖.• Iterate wk until convergence.
4. Set k ← k + 1. If k ≤ m, return to step 3.
Parallel algorithm
1. Center the data to make its mean zero, and then whiten the result to give X.
2. Decide on the number, m, of independent components to be extracted.
3. Initialize (e.g., randomly) the r-vectors w1, . . . ,wm, each to have unit norm. Let W =(w1, · · · ,wm)τ .
4. Carry out a symmetric orthogonalization of W by W← (WWτ )−1/2W.
5. For each k = 1, 2, . . . ,m, let wk ← E(Xg(wτkX)) − wkE(g′(wτ
kX)) be the FastICA single-component update for wk, where g and g′ are given in Table 1. In practice, the expectationsare estimated using sample averages.
6. Carry out another symmetric orthogonalization of W.
7. If convergence has not occurred, return to step 5.
23
The function L(θ|x) is to be maximized over the parameters θ. The expectation of L(θ|x)
in (41) with respect to the prior density qη(s) of S is given by
L(θ|x) =∫L(θ|x)qη(s)ds =
∫qη(s) log pθ(x, s)ds−
∫qη(s) log pθ(s|x)ds
=∫qη(s) log pθ(x, s)ds +
∫qη(s) log
[qη(s)
pθ(s|x)qη(s)
]ds
=∫qη(s) log
[pθ(x, s)
qη(s)
]ds +
∫qη(s) log
[qη(s)
pθ(s|x)
]ds
= V (x|θ, η) +KL(qη||pθ), (61)
where
V (x|θ, η) =∫qη(s) log pθ(x, s)ds−
∫qη(s) log qη(s)ds (62)
is the difference between the expected energy under qη and the entropy of qη (which does not
depend upon θ), and KL(q||p) is the Kullback-Liebler divergence between the prior density
qη(s) and the posterior density pθ(s|x). The negative of V is also known by those in statistical
physics as (variational) free energy. Note that
KL(qη||pθ) =∫qη(s) log
[qη(s)
pθ(s|x)
]ds
= Eη
{− log
[pθ(s|x)
qη(s)
]}
≥ − log
{Eη
[pθ(s|x)
qη(s)
]}
= − log{∫
pθ(s|x)ds}
= 0, (63)
where we used Jensen’s inequality E(f(x)) ≥ f(E(x)) for the convex function f(x) = − log(x)
(Loeve, 1963, Section 9.3e), and Eη indicates expectation taken over the density qη. Thus,
KL(qη||pθ) ≥ 0, (64)
so that
L(θ|x) ≥ V (x|θ, η), (65)
with equality if qη(s) = pθ(s|x), in which case, from (42), the log-likelihood (46) becomes
L(θ|x) = V (x|θ, η). To maximize the log-likelihood, we use the EM algorithm given in Table
4. The iterations will increase L(θ|x) at every iteration. The main drawback of the EM
algorithm is that it has a tendency to get captured at local maxima of the likelihood surface.
24
Table 4. EM algorithm for maximum likelihood ICA.
1. Set the model parameter estimate, θ, at an initial value θ0.
2. For k = 1, 2, . . . , iterate between the following two steps:
• E-Step: Fix the model parameter estimate at θk−1. Update the variational parameterestimate η by maximizing V (x|θk−1, η) with respect to η:
ηk ← arg maxη
V (x|θk−1, η).
The maximum occurs when qηk(s) = p
θk−1(s|x), at which point L(θk−1|x) =
V (x|θk−1, ηk).
• M-Step: Fix the variational parameter estimate at ηk. Update the model parameterestimate θ by maximizing V (x|θ, ηk) with respect to θ:
θk ← arg maxθ
V (x|θ, ηk) = arg maxθ
∫qηk
(s) log pθ(x, s)ds.
3. Stop when convergence is attained.
4.6.2 Square Mixing and the FastICA Algorithm. If the density of the m-vector S =
(S1, · · · , Sm) is qS(s), then the density of the linear transformation X = AS, where A is
square and nonsingular, is pX(x) = |det(W)|qS(s), where W = A−1. Statistical indepen-
dence of the sources, implies that the joint density, qS(s), can be written as a product of
its m component source densities, qS(s) =∏m
j=1 qSj (sj), where qSj (sj) is the density of Sj.
Hence, the joint density of X is
pX(x) = |det(W)|m∏
j=1
qSj (wτj x), (66)
where wτj is the jth row of W. Now, suppose we are given n i.i.d. observations, x1. . . . ,xn,
on X. Then, the log-likelihood function (divided by n) is
L(W|{xi}) = log |det(W)|+ E
m∑
j=1
log qSj (wτj xi)
, (67)
where “E” represents sample average over the n observations. There are several ways of
maximizing the log-likelihood function (48), including the following FastICA-type algorithm.
The derivative of L(W) with respect to W is given by
∂ logL(W)
∂W= (Wτ)−1 + E(g(WX)Xτ), (68)
25
where g(S) = (g1(S1), · · · , gm(Sm))τ and gj = (log qSj)′ = q′Sj/qSj . This suggests the follow-
ing gradient of the log-likelihood (49):
∆W ∝ (Wτ)−1 + E(g(WX)Xτ ), (69)
where ∆W is the difference between successive iterations of W. A stochastic version of
(50) was introduced by Bell and Sejnowski (1995), who derived it using different principles.
We eliminate the matrix inversion of Wτ at each iteration step, which tends to slow down
this algorithm, by postmultiplying both sides of (50) by WτW. This gives us a simple
formulation of the ML algorithm:
W←W + µ[Im + E(g(S)Sτ )]W, (70)
where S = WX and µ is the learning rate. Because this algorithm converges if E(g(S)Sτ ) =
Im, this condition implies that, for i 6= j, Si is uncorrelated with gj(Sj).
For a given value of m, let W = (w1, · · · ,wm)τ . Step 5 in the parallel FastICA algorithm
in Table 3 can be written in matrix form as follows:
W←W + diag{αi} [diag{λi} − E(g(S)Sτ )]W, (71)
where S = (S1, · · · , Sm)τ , Si = wτi X, λi = E(Sig(Si)), and αi = 1/(E(g′(Si) − λi). The
second term on the right-hand side of (52) can be rearranged to give
W←W + diag{αiλi}[Im − diag{λ−1i }E(g(S)Sτ )]W. (72)
Hyvarinen (1999) recognized that because the ML algorithm (51) is just a special case of the
FastICA algorithm (53), the FastICA algorithm as given in Table 5 can be interpreted as
maximizing the likelihood (48), thereby directly obtaining the ML estimate of W. The scalar
learning rate µ has now become a more flexible part of the iterative process. Furthermore,
it turns out through simulation studies that careful choice of {αi} and {λi} can speed up
convergence of the FastICA algorithm to be 10–100 times faster than the gradient approach
in deriving ML estimates.
5. LINEAR MIXING: II. NOISY ICA
The linear mixing version of noisy ICA,
X = AS + e, (73)
26
Table 5. FastICA algorithm for obtaining maximum likelihood estimates.
1. Center the data to make its mean zero, and then whiten the result to give X.
2. Decide on the number, m, of independent components to be extracted.
3. Randomly initialize a separating matrix W.
4. Compute S = WX.
5. Compute λi = E(Sig(Si)), αi = 1/(E(g′(Si))− λi), i = 1, 2, . . . ,m.
6. Update W by
W←W + diag{αi} [diag{λi} − E(g(S)Sτ )]W.
7. Carry out a symmetric orthogonalization of W by
W← (WWτ )−1/2W.
8. If convergence has not occurred, return to step 4.
where A is a full-rank (r×m) mixing matrix with unknown coefficients, has much in common
with factor analysis (Lawley and Maxwell, 1971; Harman, 1976). If we assume that the noise
component e has zero mean, a diagonal (r× r) covariance matrix, cov(e) = Ψ, with positive
diagonal entries, and that S and e are uncorrelated (E(Seτ ) = 0), then (54) reduces to the
classical common factor analysis model (FA), where the sources are called factors. For the
model (54), µ = 0 and ΣXX = AAτ + Ψ. The BSS (and ICA) problem for model (54) is to
estimate A and recover S.
5.0.3 Principal Components FA. Without making any distributional assumption (e.g.,
Gaussian) for the sources in (2), we can determine A using a least-squares formulation. In
fact, premultiplying (2) by the Moore-Penrose generalized inverse, B = (AτA)−1Aτ , of A,
and then substituting the result in terms of S back into (2), we can re-express the model as
X = CX + E, (74)
where C = AB has rank m, A and B are full-rank matrices each of rank m, E = (I−C)e,
and X and E both have mean zero. The model (3) is the multivariate reduced-rank regression
model corresponding to principal components analysis (Izenman, 1975). The least-squares
27
criterion,
E{(X−ABX)τ (X−ABX)} (75)
is, therefore, minimized by setting
A = (v1, · · · ,vm) = Bτ , (76)
where vj is the eigenvector corresponding to the jth largest eigenvalue of ΣXX . The rows of
the matrix B give the coefficients of the m principal components scores, vτj X, j = 1, 2, . . . , m,
and the eigenvalues of ΣXX , which are usually ordered from largest to smallest, measure the
variance (or power) of the m sources.
Because C = (AT)(T−1B) for any nonsingular (m×m)-matrix T, we can only determine
A (and, hence, also S) up to a rotation. In factor analysis, this is generally referred to as
the problem of factor indeterminancy. Although this solution does not depend on Ψ, an
adjustment to the analysis can be made by considering the matrix ΣXX − Ψ in place of
ΣXX . This approach, usually called the principal factor method, has sufficient computational
defects that it has generally been abandoned in favor of the maximum-likelihood (ML)
method.
5.0.4 Maximum-Likelihood FA. The ML method assumes a fully parametric model
in which the m sources in (2) are distributed as multivariate Gaussian, S ∼ Nm(0, Im),
independent of the noise, which is also multivariate Gaussian, e ∼ Nr(0,Ψ), where Ψ is
diagonal. In some formulations, Ψ = a2Ir, where a is an unknown constant.
Given n independent observations, x1, . . . ,xn, on X, we compute the sample covariance
matrix ΣXX as before, which has a Wishart distribution: nΣXX ∼ Wr(n,ΣXX). ML
estimators of A and Ψ are obtained by maximizing the logarithm of the likelihood function,
logeL = −n2
loge |AAτ + Ψ| − n
2tr{ΣXX(AAτ + Ψ)−1}, (77)
where we have used (8.113) and ignored constants and terms which do not involve Λ or Ψ.
We apply the EM algorithm to maximize logeL with respect to A and Ψ (Rubin and
Thayer, 1982). See Table 1. The algorithm treats the unobservable source scores {si} as
if they were missing data. If the {si} were actually observed, the complete-data likelihood
would be given by the joint distribution of the {si} and the {ei = xi −Asi},
Lik =n∏
i=1
{(2π)r/2|Ψ|−1/2e−
12eτiΨ−1ei(2π)−r/2e−
12fτisi
}
28
Table 6. EM algorithm for maximum likelihood factor analysis.
1. Let A0 and Ψ0 be initial guesses for the parameter matrices A and Ψ, respectively.
2. For k = 1, 2, . . . , iterate between the following two steps:
• E-Step: Compute
CXX = n−1n∑
i=1
XiXτi
C(k−1)XS = CXXδτ
k−1
C(k−1)SS = δk−1CXXδτ
k−1 + ∆k−1
where
δk−1 = Aτk−1(Ak−1A
τk−1 + Ψk−1)
−1
∆k−1 = It − δk−1Ak−1.
• M-Step: Update the parameter estimates,
Ak ← C(k−1)XS (C
(k−1)SS )−1
Ψk ← diag{CXX −C(k−1)XS (C
(k−1)SS )−1C
(k−1)τXS }.
3. Stop when convergence has been attained.
=
(2π)r
r∏
j=1
ψjj
−n/2
e− 1
2
∑n
i=1
∑r
j=1
(xij−Ajsi)2
ψjj
× {(2π)r}−n/2 e−12
∑n
i=1sτisi, (78)
where xij is the jth component of xi, Aj is the jth row of A, and ψjj is the jth diagonal
element of the diagonal matrix Ψ. Given the observed data {xij} and the current estimated
values of the parameters, the conditional expectation of (8.124), taken over the distribution
of the missing data {Si}, is equal to eloge L.
The logarithm of (8.124) is
loge(Lik) = −n2
r∑
j=1
loge(ψjj)−1
2
n∑
i=1
r∑
j=1
(xij −Ajsi)2
ψjj− 1
2
n∑
i=1
Aτi si. (79)
The E-step of the EM algorithm entails finding the conditional expectation of (8.125), given
the observed data {xi} and the current values of the parameters A and Ψ. Because the
29
joint distribution of xi and si given A and Ψ, is (r + t)-variate Gaussian, the conditional
distribution of si given xi is
(si|xi,A,Ψ) ∼ Nt(δxi,∆), (80)
where
δ = Aτ(AAτ + Ψ)−1 (81)
∆ = It −Aτ (AAτ + Ψ)−1A. (82)
To find the expectation of (8.125), we need to find the expectations of the following sufficient
statistics,
CXX = n−1n∑
i=1
xixτi , CXS = n−1
n∑
i=1
xisτi , CSS = n−1
n∑
i=1
sisτi .
Given the data {xi} and parameters A and Ψ, the expectations are
C∗XX = E(CXX |{xi},A,Ψ) = CXX (83)
C∗XS = E(CXS|{xi},A,Ψ) = CXXδτ (84)
C∗SS = E(CSS|{xi},A,Ψ) = δCXXδτ + ∆. (85)
Equations (10) through (14) define the E-step based upon the observed data {xi} and the
current values of the parameter estimates Λ and Ψ.
The M -step provides the updated versions of the ML estimates by using the regression
estimates,
Λ = C∗XSC
∗−1SS (86)
Ψ = diag{C∗XX −C∗
XSC∗−1SS C∗τ
XS}. (87)
The current estimates (15) and (16) are substituted for A and Ψ, respectively, in (10) and
(11) to get updated estimates of δ and ∆, which are then used to recompute C∗XS and C∗
SS,
and get new values of A and Ψ. The method is iterated until we arrive at convergence.
MLFA, however, cannot resolve the BSS problem precisely because of these Gaussian
assumptions. Gaussian variables which are mutually uncorrelated are also automatically
independent, and so MLFA only requires that the sources be uncorrelated. Furthermore,
MLFA suffers from a similar ailment as does principal component FA: the likelihood function
is rotationally-invariant in factor space, and so the sources S and the mixing matrix A can
only be defined up to an arbitrary rotation.
30
5.0.5 Independent Factor Analysis. As an alternative to the FA assumptions for dealing
with the BSS problem, Attias (1999) introduced the technique of independent factor analysis
(IFA) in which the model is still given by (2) with e ∼ Nr(0,Ψ), Ψ not necessarily diago-
nal, but now each unobserved source signal Sj is assumed to be independently distributed
according to a non-Gaussian density. In particular, Attias modelled each source density by
an arbitrary mixture of univariate Gaussian (MoG) densities,
qSj (sj) =Ij∑
i=1
wijφηij (sj), (88)
where φηij (s) is N (µij, σ2ij), ηij = (µij, σ
2ij), and wij > 0 is the mixing proportion attached
to the ith component of the jth source density, i = 1, 2, . . . , Ij, with∑Ij
i=1 wij = 1, j =
1, 2, . . . , m (Attias, 1999).
The MoG density (69) can mimic both super-Gaussian and sub-Gaussian densities by
using a large enough set of component densities and is a major reason why it has played
such an important role in ICA modelling. MoG densities became widely used in statistics
after Tukey (1960) showed how useful their representation was for modelling the presence
of outliers and in robustness studies. He considered mixtures consisting of two components
which have the same mean but different variances, and referred to the mixture density
p(s) = (1− w)pS1(s) + wpS2(s), with w small, as a contaminated density. Since then, MoG
densities have been used in many different settings (Titterington, Smith, and Makov, 1985;
McLachlan and Basford, 1988; Everitt and Hand, 1981). The main disadvantage of working
with MoG densities is that the total number of parameters can grow to be very large.
The joint source density, qS(s), can be written as a product of its m component source
densities,
qS(s) =m∏
j=1
qSj (sj) =m∏
j=1
Ij∑
i=1
wijφηij (sj) =∑
i
wiφη(s), (89)
where η = {ηij}, wi is a product of the {wij}, and φη(s) is a product of the {φηij (sj)}.The parameters of this mixture, the mixing matrix A, and the noise covariance matrix
Ψ are estimated using an appropriate EM algorithm.
6. REFERENCES
Attias, H. (1999), “Independent Factor Analysis,” Neural Computation, 11, 803–852.
Bach, F.R. and Jordan, M.I. (2002), “Kernel Independent Component Analysis,” Journal
31
of Machine Learning Research, 3, 1–48.
Back, A.D. and Weigend, A.S. (1997), “A First Application of Independent Component
Analysis to Extracting Structure from Stock Returns,” International Journal of Neural
Systems, 8, 473–484.
Bell, A.J. and Sejnowski, T.J. (1995), “An Information-Maximization Approach to Blind
Separation and Blind Deconvolution,” Neural Computation, 7, 1129–1159.
Cardoso, J.-F. and Pham, D.-T. (2001), “Separation of Non-Stationary Sources: Algo-
rithms and Performance,” In Independent Component Analysis: Principles and Prac-
tice, Roberts, S. and Everson, R. (eds.), Cambridge, U.K.: Cambridge University Press,
pp. 158–180.
Cherry, E.C. (1953), “Some Experiments in the Recognition of Speech, With One and Two
Ears,” Journal of the Acoustical Society of America, 25, 975–979.
Cichocki, A. and Amari, S. (2003), Adaptive Blind Signal and Image Processing, New York:
Wiley.
Coman, P. (1994), “Independent Component Analysis — A New Concept?” Signal Pro-
cessing, 36, 287–314.
Cook, D., Buja, A., and Cabrera, J. (1993), “Projection Pursuit Indexes Based on Or-
thogonal Function Expansions,” Journal of Computational and Graphical Statistics, 2,
225–250.
Cover, T. and Thomas, J. (1991), Elements of Information Theory, Volume 1, New York:
Wiley.
Diaconis, P. and Freedman, D. (1984), “Asymptotics of Graphical Projection Pursuit,”
Annals of Statistics, 12, 793–815.
Donoho, D. (1981), “On Minimum Entropy Deconvolution,” In Applied Time Series Anal-
ysis II, D.A. Finley (ed.), New York: Academic Press, pp. 565–608.
Everitt, B.S. and Hand, D.J. (1981), Finite Mixture Distributions, London: Chapman and
Hall.
32
Fodor, I.K. and Kamath, C. (2003), “Using Independent Component Analysis to Separate
Signals in Climate Data,” unpublished technical report, Lawrence Livermore National
Laboratories, Livermore, CA.
Friedman, J. (1987), “Exploratory Projection Pursuit,” Journal of the American Statistical
Association, 82, 249–266.
Friedman, J. and Tukey, J. (1974), “A Projection Pursuit Algorithm for Exploratory Data
Analysis,” IEEE Transactions on Computers, Series C, 23, 881–889.
Giannakopoulos, X., Karhunen, J., and Oja, E. (1999), “An Experimental Comparison
of Neural Algorithms for Independent Component Analysis and Blind Separation,”
International Journal of Neural Systems, 9, 99–114.
Girolami, M. (ed.) (2000), Advances in Independent Component Analysis, New York:
Springer-Verlag.
Gray, R.M. (1990), Entropy and Information Theory, New York: Springer.
Green, P.J. (1990), “Bayesian Reconstructions From Emission Tomography Data Using a
Modified EM Algorithm,” IEEE Transactions on Medical Imaging, 16(5), 516–526.
Hall, P. (1989), “On Polynomial-Based Projection Indices for Exploratory Projection Pur-
suit,” Annals of Statistics, 17, 589–605.
Harman, H.H. (1976), Modern Factor Analysis, Third Edition Revised, Chicago: The Uni-
versity of Chicago Press.
Hastie, T. and Tibshirani, R. (2003), “Independent Components Analysis Through Product
Density Estimation,” unpublished manuscript.
Hastie, T., Tibshirani, R., and Friedman, J. (2001), The Elements of Statistical Learning:
Data Mining, Inference, and Prediction, New York: Springer-Verlag.
Herault, J. and Jutten, C. (1986), “Space or Time Processing by Neural Network Mod-
els,” in Proceedings of the AIP Conference: Neural Networks for Computing (ed.:
J.S. Denker), 151, American Institute for Physics.
Huber, P. (1985), “Projection Pursuit,” Annals of Statistics, 53, 73–101.
33
Hurri, Gavert, Sarela, and Hyvarinen, A. (1998), “The FastICA Package for MATLAB,”
http://isp.imm.dtu.dk/toolbox/
Hyvarinen, A. (1998), “New Approximations of Differential Entropy for Independent Com-
ponent Analysis and Projection Pursuit,” In Advances in Neural Information Process-
ing Systems. 10, 273–279.
Hyvarinen, A. (1999), “The Fixed-Point Algorithm and Maximum Likelihood Estimation
for Independent Component Analysis,” Neural Processing Letters, 10, 1–5.
Hyvarinen, A., Karhunen, J. and Oja, E. (2001), Independent Component Analysis, New
York: Wiley.
Izenman, A.J. (1975), “Reduced-Rank Regression for the Multivariate Linear Model,” Jour-
nal of Multivariate Analysis, 5, 248–264.
Izenman, A.J. (1991), “Recent Developments in Nonparametric Density Estimation,” Jour-
nal of the American Statistical Association, 86, 205–224.
Joe, H. (1989), “Estimation of Entropy and Other Functionals of a Multivariate Density,”
Annals of the Institute of Statistical Mathematics, 41, 683–697.
Jones, M.C. and Sibson, R. (1987), “What is Projection Pursuit?” Journal of the Royal
Statistical Society, Series A, 150, 1–36.
Jutten, C. (2000), “Source Separation: From Dusk Till Dawn,” in Proceedings of the 2nd
International Workshop on Independent Component Analysis and Blind Source Sepa-
ration (ICA 2000), 15–26, Helsinki, Finland.
Kruskal, J.B. (1969), “Toward a Practical Method Which Helps Uncover the Structure
of a Set of Multivariate Observations by Finding the Linear Transformation Which
Optimizes a New ‘Index of Condensation’,” In Statistical Computation (R.C. Milton
and J.A. Nelder, eds.), pp. 427–440, New York: Academic Press.
Kruskal, J.B. (1972), “Linear Transformation of Multivariate Data to Reveal Clustering,”
In Multidimensional Scaling: Theory and Applications in the Behavioural Sciences,
Volume 1 (R.N. Shepard, A.K. Romney, and S.B. Nerlove, eds.), pp. 179–191, London:
Seminar Press.
34
Lawley, D.N. and Maxwell, A.E. (1971), Factor Analysis as a Statistical Method, Second
Edition, New York: American Elsevier Publishing Company.
Lee, T.-W. (1998), Independent Component Analysis — Theory and Applications, Kluwer.
Loeve, M. (1963), Probability Theory, New York: Van Nostrand.
Marchini, J.L., Heaton, C., and Ripley, D. (2003), “The fastICA Package, Version 1.1-3,”
http://www.stats.ox.ac.uk/∼marchini/software.html
McKeown, M., Makeig, S, Brown, S., Jung, T.-P., Kindermann, S., Bell, A.J., and Se-
jnowski, T. (1998), “Analysis of fMRI Data by Blind Separation Into Independent
Spatial Components,” Human Brain Mapping, 6, 160–188.
McLachlan, G.J. and Basford, K.E. (1988), Mixture Models: Inference and Applications to
Clustering, New York: Dekker.
Nandi, A.K. (ed.) (1999), Blind Estimation Using Higher-Order Statistics, Kluwer.
Parra, L.C. and Spence, C.D. (2001), “Separation of Non-Stationary Natural Signals,” In
Independent Component Analysis: Principles and Practice, Roberts, S. and Everson,
R. (eds.), Cambridge, U.K.: Cambridge University Press, pp. 135–157.
Pechura, M. and Martin, J.B. (1991) (eds.), Mapping the Brain and Its Functions: Inte-
grating Enabling Technologies into Neuroscience Research, Washington, D.C.: National
Academy Press.
Roberts, S. and Everson, R. (eds.) (2001), Independent Component Analysis: Principles
and Practice, Cambridge, U.K.: Cambridge University Press.
Roweis, S. and Ghahramani, Z. (1999), “A Unifying Review of Linear Gaussian Models,”
Neural Computation, 11, 305–345.
Rubin, D.B. and Thayer, D.T. (1982), “EM Algorithms for ML Factor Analysis,” Psy-
chometrika, 47, 69–76.
Salerno, E., Bedini, L., Kuruoglu, E., and Tonazzini, A. (2002), “Blind Image Analysis
Helps Research in Cosmology,” ERCIM News, 49, April 2002.
35
Thisted, R.A. (1988), Elements of Statistical Computing: Numerical Computation, New
York: Chapman and Hall.
Titterington, D.M., Smith, A.F.M., and Makov, U.E. (1985), Statistical Analysis of Finite
Mixture Distributions, New York: Wiley.
Tukey, J.W. (1960), “A Survey of Sampling From Contaminated Distributions,” In: Olkin,
I. (Ed.), Contributions to Probability and Statistics, University Press, Stanford, CA.
Tukey, P.A. and Tukey, J.W. (1981), “Graphical Display of Data Sets in 3 or More Dimen-
sions,” in Interpreting Multivariate Data (V. Barnett, ed.), pp. 187–275, New York:
Wiley.
Venables, W.N. and Ripley, B.D. (2002), Modern Applied Statistics with S, Fourth Edition,
New York: Springer-Verlag.
Weir, I.S. (1997), “Fully Bayesian SPECT Reconstructions,” Journal of the American Sta-
tistical Association, 92, 49–60.
Weir, I.S. and Green, P.J. (1994), “Modelling Data From Single Photon Emission Com-
puted Tomography,” In K.V. Mardia (ed.), Statistics and Images, 2, 313–338. Carfax,
Abingdon.