Finding Local Linear Correlations in High Dimensional Data

25
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Finding Local Linear Correlations in High Dimensional Data Xiang Zhang Feng Pan Wei Wang University of North Carolina at Chapel Hill Speaker: Xiang Zhang

description

Finding Local Linear Correlations in High Dimensional Data. Xiang Zhang Feng Pan Wei Wang University of North Carolina at Chapel Hill. Speaker: Xiang Zhang. Finding Latent Patterns in High Dimensional Data. An important research problem with wide applications - PowerPoint PPT Presentation

Transcript of Finding Local Linear Correlations in High Dimensional Data

Page 1: Finding Local Linear Correlations in High Dimensional Data

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Finding Local Linear Correlations in High Dimensional Data

Xiang Zhang Feng Pan Wei WangUniversity of North Carolina at Chapel Hill

Speaker: Xiang Zhang

Page 2: Finding Local Linear Correlations in High Dimensional Data

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Finding Latent Patterns in High Dimensional Data

• An important research problem with wide applicationsbiology (gene expression analysis) customer transactions, and so on.

• Common approaches feature selection feature transformation subspace clustering

Page 3: Finding Local Linear Correlations in High Dimensional Data

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Existing Approaches

• Feature selection find a single representative subset of features that are

most relevant for the data mining task at hand

• Feature transformation find a set of new (transformed) features that contain the

information in the original data as much as possible Principal Component Analysis (PCA)

• Correlation clustering find clusters of data points that may not exist in the axis

parallel subspaces but only exist in the projected subspaces.

Page 4: Finding Local Linear Correlations in High Dimensional Data

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Motivation Example

0362 972 xxx

0523 8651 xxxx

Question: How to find these local linear correlations (using existing methods)?

linearly correlated genes

Page 5: Finding Local Linear Correlations in High Dimensional Data

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Applying PCA — Correlated?• PCA is an effective way to determine whether a

set of features is strongly correlated

• A global transformation applied to the entire dataset

a few eigenvectors describe most variance in the dataset small amount of variance represented by the remaining eigenvectors small residual variance indicates strong correlation

Page 6: Finding Local Linear Correlations in High Dimensional Data

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Applying PCA – Representation?• The linear correlation is

represented as the hyperplane that is orthogonal to the eigenvectors with the minimum variances

0321 xxx

[1, -1, 1]

0362 972 xxx

0523 8651 xxxx

linear correlations reestablished by full-dimensional PCAembedded linear correlations

Page 7: Finding Local Linear Correlations in High Dimensional Data

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Applying Bi-clustering or Correlation Clustering Methods

• Correlation clustering no obvious clustering

structure

• Bi-clustering no strong pair-wise

correlations

linearly correlated genes

Page 8: Finding Local Linear Correlations in High Dimensional Data

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Revisiting Existing Work

• Feature selection finds only one representative subset of features

• Feature transformation performs one and the same feature transformation for the

entire dataset does not really eliminate the impact of any original

attributes

• Correlation clustering projected subspaces are usually found by applying

standard feature transformation method, such as PCA

Page 9: Finding Local Linear Correlations in High Dimensional Data

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Local Linear Correlations - formalization

• Idea: formalize local linear correlations as strongly correlated feature subsetsDetermining if a feature subset is correlated

small residual variance

The correlation may not be supported by all data points -- noise, domain knowledge…supported by a large portion of the data points

Page 10: Finding Local Linear Correlations in High Dimensional Data

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Problem Formalization

• Suppose that F (m by n) be a submatrix of the dataset D (M by N)

• Let { } be the eigenvalues of the covariance matrix of F and arranged in ascending order

• F is strongly correlated feature subset if

i

n

jj

k

ii

1

1 Mmand(1) (2)

total variance

variance on the k eigenvectors having smallest eigenvalues (residual variance)

number of supporting data points

total number of data points

Page 11: Finding Local Linear Correlations in High Dimensional Data

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Problem Formalization

• Suppose that F (m by n) be a submatrix of the dataset D (M by N)

n

jj

k

ii

kFf

1

1),(

larger k, stronger correlation

smaller ε, stronger correlation

K and ε, together control the strength of the correlation

Eigenvalue idE

igen

valu

es

larger k smaller ε

Page 12: Finding Local Linear Correlations in High Dimensional Data

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Goal

• Goal: to find all strongly correlated feature subsets

• Enumerate all sub-matrices?Not feasible (2M×N sub-matrices in total)Efficient algorithm needed

• Any property we can use?Monotonicity of the objective function

Page 13: Finding Local Linear Correlations in High Dimensional Data

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Monotonicity

• Monotonic w.r.t. the feature subset If a feature subset is strongly correlated, all its

supersets are also strongly correlated Derived from Interlacing Eigenvalue Theorem

Allow us to focus on finding the smallest feature subsets that are strongly correlated

Enable efficient algorithm – no exhaustive enumeration needed

'1

'2

'21

'1 nnn

i

'i

Page 14: Finding Local Linear Correlations in High Dimensional Data

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

The CARE Algorithm

• Selecting the feature subsetsEnumerate feature subsets from smaller size to

larger size (DFS or BFS) If a feature subset is strongly correlated, then its

supersets are pruned (monotonicity of the objective function)

Further pruning possible (refer to paper for details)

Page 15: Finding Local Linear Correlations in High Dimensional Data

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Monotonicity

• Non-monotonic w.r.t. the point subsetAdding (or deleting) point from a feature subset

can increase or decrease the correlation among the features

Exhaustive enumeration infeasible – effective heuristic needed

Page 16: Finding Local Linear Correlations in High Dimensional Data

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

The CARE Algorithm

• Selecting the point subsets Feature subset may only correlate on a subset of

data points If a feature subset is not strongly correlated on

all data points, how to chose the proper point subset?

Page 17: Finding Local Linear Correlations in High Dimensional Data

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

The CARE Algorithm

• Successive point deletion heuristicgreedy algorithm – in each iteration, delete the

point that resulting the maximum increasing of the correlation among the subset of features

Inefficient – need to evaluate objective function for all data points

Page 18: Finding Local Linear Correlations in High Dimensional Data

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

The CARE Algorithm

• Distance-based point deletion heuristic Let S1 be the subspace spanned by the k eigenvectors with

the smallest eigenvalues Let S2 be the subspace spanned by the remaining n-k

eigenvectors. Intuition: Try to reduce the variance in S1 as much as

possible while retaining the variance in S2

Directly delete (1-δ)M points having large variance in S1 and small variance in S2 (refer to paper for details)

Page 19: Finding Local Linear Correlations in High Dimensional Data

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

The CARE Algorithm

A comparison between two point deletion heuristics

successive distance-based

Page 20: Finding Local Linear Correlations in High Dimensional Data

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Experimental Results (Synthetic)

Linear correlation reestablished

Full-dimensional PCA CARE

Linear correlation embedded

Page 21: Finding Local Linear Correlations in High Dimensional Data

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Pair-wise correlations

Linear correlation embedded (hyperplan representation)

Experimental Results (Synthetic)

Page 22: Finding Local Linear Correlations in High Dimensional Data

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Scalability evaluation

Experimental Results (Synthetic)

Page 23: Finding Local Linear Correlations in High Dimensional Data

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Experimental Results (Wage)

Correlation clustering method & CARE

6 AYWYE

CARE only

A comparison between correlation clustering method and CARE(dataset (534×11) http://lib.stat.cmu.edu/datasets/CPS_85_Wages)

805.425.4 AWYW

Page 24: Finding Local Linear Correlations in High Dimensional Data

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Experimental Results

Linearly correlated genes (Hyperplan representations) (220 genes for 42 mouse strains)

Nrg4: cell partMyh7: cell part; intracelluar partHist1h2bk: cell part; intracelluar partArntl: cell part; intracelluar part

Nrg4: integral to membraneOlfr281: integral to membraneSlco1a1: integral to membraneP196867: N/A

Oazin: catalytic activityCtse: catalytic activityMgst3: catalytic activity

Hspb2: cellular physiological process2810453L12Rik: cellular physiological process1010001D01Rik: cellular physiological processP213651: N/A

Ldb3: intracellular partSec61g: intracellular partExosc4: intracellular partBC048403: N/A

Mgst3: catalytic activity; intracellular part Nr1d2: intracellular part; metal ion bindingCtse: catalytic activityPgm3: metal ion binding

Hspb2: cellular metabolismSec61b: cellular metabolismGucy2g: cellular metabolismSdh1: cellular metabolism

Ptk6: membraneGucy2g: integral to membraneClec2g: integral to membraneH2-Q2: integral to membrane

Page 25: Finding Local Linear Correlations in High Dimensional Data

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Thank You !

Questions?