Presented by Nagesh Adluru

24
KERNEL INDEPENDENT COMPONENT ANALYSIS BY FRANCIS BACH & MICHAEL JORDAN International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2003 Presented by Nagesh Adluru

description

KERNEL INDEPENDENT COMPONENT ANALYSIS BY FRANCIS BACH & MICHAEL JORDAN International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2003. Presented by Nagesh Adluru. Goal of the Paper. - PowerPoint PPT Presentation

Transcript of Presented by Nagesh Adluru

Page 1: Presented by Nagesh Adluru

KERNEL INDEPENDENT COMPONENT ANALYSIS

BYFRANCIS BACH & MICHAEL JORDANInternational Conference on Acoustics,

Speech, and Signal Processing (ICASSP), 2003

Presented by Nagesh Adluru

Page 2: Presented by Nagesh Adluru

22.04.23 2

Goal of the Paper

To perform Independent Component Analysis (ICA) in a novel way which is better, robust compared to the existing techniques.

Page 3: Presented by Nagesh Adluru

22.04.23 3

Concepts Involved

ICA – Independent Component AnalysisMutual InformationF – CorrelationRKHS – Reproducing Kernel Hilbert

SpacesCCA – Canonical Correlation AnalysisKICA – Kernel ICAKGV – Kernel Generalized Variance

Page 4: Presented by Nagesh Adluru

22.04.23 4

ICA – Independent Component Analysis

ICA is unsupervised learning

We have estimate x given the set of observations of y (Assumption components of x are independent).

So we have to estimate W such that x = Wy

Page 5: Presented by Nagesh Adluru

22.04.23 5

ICA – Independent Component Analysis

ICA is semi-parametric. Because we do not know anything about the

distribution of x it is non-parametric. But we do know the distribution of y and that it

is a distribution of ‘linear combination’ of components of x.

So the problem is semi-parametric and kernels do well in such situations.

Page 6: Presented by Nagesh Adluru

22.04.23 6

ICA – Independent Component Analysis

If we knew the distribution of x then we can assume the ‘x-space’ and hence can find W using gradient or fixed-point algorithm.

But not in practice!!! So how??Since we are looking for independent

components we need to maximize the independence or minimize mutual information.

Page 7: Presented by Nagesh Adluru

22.04.23 7

Mutual Information

Mutual Information is an abstract term that is used to describe independence among variables.

The mutual information is the least when the dependence is the least.

So looks promising to be explored!!! Prior work has focused on approximations to this

term because of difficulty involved with real-variables and finite samples.

Kernels offer better ways.

Page 8: Presented by Nagesh Adluru

22.04.23 8

F – Correlation

F – Correlation is defined as below:

If x1 and x2 are independent then the value is zero but converse is important here.

Page 9: Presented by Nagesh Adluru

22.04.23 9

F – Correlation

Converse: If is zero then the x1 and x2 are independent.

Is that true?It is true only if F ‘space’ is very large.But it is also true if F is restricted to the

reproducing Kernel Hilbert Spaces based on Gaussian kernels.

Page 10: Presented by Nagesh Adluru

22.04.23 10

F – Correlation

Since the converse holds even for the restriction of F to RKHS, a mutual information can be defined such that if it is 0 then the two variables are independent.

Page 11: Presented by Nagesh Adluru

22.04.23 11

RKHS – Reproducing Kernel Hilbert Spaces Operations using kernels can be treated as

operations in Hilbert space. The reproducing ability of the kernels of

operations in Euclidean space is exploitable for computational purposes.

So the correlation between fs can be interpreted as the correlation between Фs which is defined as the canonical correlation between Фs.

Page 12: Presented by Nagesh Adluru

22.04.23 12

CCA – Canonical Correlation Analysis

CCA vs PCAPCA maximizes variance of projection of

distribution of a single random vector.CCA maximizes correlation between

projections of distributions of two or more random vectors. CIJ = cov(xI, xJ)

Page 13: Presented by Nagesh Adluru

22.04.23 13

CCA – Canonical Correlation Analysis

While PCA leads to eigenvector problem CCA leads to generalized eigenvector problem. (Eigenvector problem: AV = V Generalized eigenvector problem: AV = BV)

The CCA can easily be kernelized and also generalized to more than two random vectors.

So the max correlation between variables can be found efficiently, which is very nice.

Page 14: Presented by Nagesh Adluru

22.04.23 14

CCA – Canonical Correlation Analysis

Though this kernelization of CCA can help us, the generalization is not precise in terms mutual independence measure using F – Correlation.

But that is not limitation in practice, both because of empirical results as well as because mutuality could be achieved using pair-wise dependence.

Page 15: Presented by Nagesh Adluru

22.04.23 15

Kernel ICA

We saw

And also that can be calculated using kernelized CCA.

So we now have Kernel – ICA not in the sense that the basic ICA is kernelized but because using kernelized CCA.

Page 16: Presented by Nagesh Adluru

22.04.23 16

KICA – Kernel ICA Algorithm

Input: W andProcedure:Estimate setMinimize are [N*N] Gram matrices for

each component of the random vector. (Equivalent to generalized CCA, where each of the m vectors is a single element vector)

Page 17: Presented by Nagesh Adluru

22.04.23 17

KICA – Kernel ICA

Computational Complexity of calculating ‘smallest’ generalized eigen value of matrices of size mN is O(N3). (Note: the eigen values are not directly related to the entries in W.)

But we can reduce it because of special properties of the Gram matrix spectrum (or range of values in its space) to O(M2N), where M is a constant < N.

Page 18: Presented by Nagesh Adluru

22.04.23 18

KICA – Kernel ICA

The next crucial job is to find minimum C(W) in the space and that W is called de-mixing matrix.

Preferably data is whitened (PCA) and W is restricted to be ‘orthogonal’ because de-correlation implies independence.

The search for W in this restricted space (called Stiefel manifold) can be done with Riemannian metric suggesting gradient type algorithms.

Page 19: Presented by Nagesh Adluru

22.04.23 19

KICA – Kernel ICA

The problem of local-minima can be solved either using heuristics (instead of random) for selecting initial W.

Also it has been shown empirically that a decent number of restarts would solve this problem when large number of samples are available.

Page 20: Presented by Nagesh Adluru

22.04.23 20

KGV – Kernel Generalized Variance

F – Correlation is the ‘smallest’ generalized eigenvalue of KCCA.

Idea with KGV is to make use of other values as well.

The mutual information contrast function is defined as where

Page 21: Presented by Nagesh Adluru

22.04.23 21

Simulation Results

The results on the simulation data showed that the KICA is better compared to other ICA algorithms like FastICA, Jade, Imax for larger number of ‘components’.

The simulation data was mixture of variety of source distributions like subgaussian, supergaussian and nearly gaussian.

The KICA is also robust for outliers.

Page 22: Presented by Nagesh Adluru

22.04.23 22

Simulation Results

Page 23: Presented by Nagesh Adluru

22.04.23 23

Conclusions

This paper proposed novel kernel-based measures for independence.

The approach is flexible and computationally demanding (because of additional search in finding eigenvalues).

Page 24: Presented by Nagesh Adluru

22.04.23 24

Questions!!