Sparsity-Cognizant Overlapping Co-clustering

1

Sparsity-Cognizant Overlapping Co-clusteringSparsity-Cognizant Overlapping Co-clustering

March 11, 2010March 11, 2010

Hao ZhuHao ZhuDept. of ECE, Univ. of Minnesota

http:// spincom.ece.umn.edu

Acknowledgements: G. Mateos, Profs. G. B. Giannakis, N. D. Sidiropoulos, A. Banerjee, and G. Leus

NSF grants CCF 0830480 and CON 014658 ARL/CTA grant no. DAAD19-01-2-0011

2 SPinCOM University of Minnesota

Outline

Motivation and context Problem statement and Plaid models Sparsity-cognizant overlapping co-clustering

(SOC) Uniqueness Simulated tests Conclusions and future research


Context

Dense, approximately constant-valued submatrices

objects

attributes

Co-clustering (biclustering) = two-way clustering

Clustering: partition the objects(samples, rows) based on a similarity criteria on their attributes(features, columns) [Tan-Steinbach-Kumar ’06]

Co-clustering: simultaneous clustering of objects and attributes [Busygin et al ’08]

NP-hard: reduce to ordinary clustering as in k-means


Context

Application areas

Social network: cohesive subgroups of actors within a network[Wasserman et al ’94]

Internet traffic: dominat host groups with strong interactions [Jin et al ’09]

Bioinformatics: interpertable biological structure in gene expression data [Lazzeroni et al ’02]

Dense, approximately constant-valued submatrices

Co-clustering (biclustering) = two-way clustering

Clustering: partition the objects(samples, rows) based on a similarity criteria on their attributes(features, columns) [Tan-Steinbach-Kumar ’06]

Co-clustering: simultaneous clustering of objects and attributes [Busygin et al ’08]

NP-hard: reduce to ordinary clustering as in k-means


Related Works and our Focus

Matrix factorization based on SVD

Probabilistic model for co-cluster membership indicators and parameters

Bipartite spectral graph partitioning [Dhillon ’01]

Overlapping co-clustering under Bayesian framework [Fu-Banerjee ’09]

Orthogonal nonnegative matrix factorization (tNMF) [Ding et al ’06]

EM algorithm for inference and parameter estimation

Search for non-overlapping co-clusters using orthogonality

E-step: Gibbs sampling for membership indicator detection

Plaid models [Lazzeroni et al ’02] Superposition of multiple overlapping layers (co-clusters)

Greedy layer search: one at a time


Related Works and our Focus (cont’d)

Borrow plaid model features

Sparse information hidden in large data set

Our focus: Sparsity-cognizant overlapping co-clustering (SOC) algorithm

Overlapping: some objects/attributes may relate to multiple co-clusters

Exploit sparsity in co-cluster membership vectors

Partial co-clustering motivated by “Uninteresting background”

Parsimonious models: more interpretable and informative

Linear model requires low computational burden

Simultaneous cross-layer optimization

Compared with greedy layer-by-layer strategy


Modeling Matrix Y: n × p

Ex-1: Internet traffic: traffic activity graph (TAG)1

induced by two groups of interacting nodes and

Yij measures the strength of the relationship between and

Track the traffic flow between inside host and outside host

Ex-2: Gene expression microarray data2

Measure the level with which gene is expressed in sample

1,2 The two pictures are taken from [Jin et al ’09] and [Lazzeroni et al ’02], respectively.


Submatrices

Hidden dense/uniform submatrices

capture a subset of that has similar feature values related to a subset of

reveal certain informative behavioral patterns

Features

distributed “sparsely” in Y compared to the data dimension np

may overlap because of some multiple-functioned nodes

Goal: Efficient co-clustering algorithms to extract the underlying submatrices, by exploiting sparsity and accounting for possible overlapping


Plaid Models

Matrix Y: superposition of k submatrices (layers)

l common to all the nodes in the layer

l : level of layer (0 background)

il (jl) =1 if ( ) is in the -th layer

0 otherwise

Row/column-level related effects il and jl

il and jl express the node-related response


Problem Statement

Problem: Given the plaid model, seek the optimal membership indicators

Efficient sub-optimal algorithm to identify the submatrices jointly Recall the submatrices are detected one at a time in [Lazzeroni et al ’02]

Data fitting error penalized by the L1 norm of the indicators

Facilitate extraction of the more informative/inpretable submatrices out of Y

The optimal solution is NP-hard Binary constraints on membership indicators [Tuner et al ’05] Product of different variables

> 0 controls the sparsity enforced


Sparsity-cognizant overlapping co-clustering (SOC)

Different from [Lazzeroni et al ’02]

Per iteration s, (s) collects all ijk(s) values, likewise for (s) and (s)

Background-layer-free residue matrix Z

Iterative cycling update of , , and

(s-1)

(s-1)(s)

(s-1)(s)

(s)

(s)

All the k layers are updated jointly, less prone to error propagation across layers

Membership indicators are updated with binary constraint combinatorial complexity


Updating (s)

Coordinate descent algorithm alternating across all the layers

Inversion of a large matrix leads to numerical instablity

Given (s-1) and (s-1)

Unconstrained quadratic programming closed-form solution

For l = 1, ..., k

Update for T cycles (T small)

Define residue matrix :

Reduce to by extracting from the rows il(s-1)=1 and the columns jl

(s-1)=1


Updating (s) and (s)

L1 norm penalty reduces to linear term due to non-negativity

Important for overlapping submatrices to eliminate cross effects

Given (s) and (s-1), determine (s)

Obtain jointly membership indicators for the i -th row

Quadratic minimization subject to {0,1} binary constraints NP-hard

Similar problems in MIMO/multiuser detection with binary alphabet

(Near-) optimal sphere decoding algorithm (SDA)

Incurs polynomial (cubic) complexity in general

Same techniques to detect (s) with (s) and (s)


Convergence and Implementation

Initialization

Data fitting error cost: bounded below and non-increasing per iteration

SOC algorithm converges (at least) to a stationary point

Background level fitting to obtain matrix Z (Recall the submatrix parameter fitting)

Parameter choices

Number of layers k : explain a certain percentage of variation

Pruning steps [Lazzeroni et al ’02], [Tuner et al ’05]

Membership indicators (0) and (0) : K-means [Tuner et al ’05]

Sparse regularization parameter : trial-and-error/bi-cross-validation [Witten et al ’09]


Uniqueness

Plain plaid models: decomposition into product of unknown matrices

Binary-valued matrices: R=[il] (n ×k) and K =[jl] (p ×k)

Diagonal matrix D = diag(1 , ... , k)

Blind source separation (BSS) [Talwar et al ’96], [van der Veen et al ’96]

Product of two matrices, finite alphabet (FA)/constant modulus (CM) constraint (Generally) uniquely identifiable with enough number of samples

3-way array (Candecomp/Parafac) [Kruskal ’77], [Sidiropoulos et al ’00]

two-way

Unique up to permutation and scaling:

Fails to hold if h = 1


Uniqueness (cont’d)Sparsity in blind identification

Sparse component analysis: very sparse representation [Georgiev et al ’07] Non-negative source separation using local dominance [Chan et al ’08]

Proposition:Proposition: Consider where diagonal matrix D and binary-valued matrices R, K are all of full rank. Each column vector kl of K is locally sparse 8l,

which means there exists an (unknown) row index jl such that

Given Z, the matrices R, D, and K are unique up to permutations.

Proof relies on convex analysis Affine hull of column vectors of K coincides with the one of Z

Under local sparseness, its convex hull becomes the intersection of the affine hull and the positive orthant

Columns of K are extreme points of convex hull

Results hold also when R is locally sparse (symmetry)


Preliminary Simulation Two uniform blocks + noises ~ Unif [0, 0.5]

SOC parameters: k=2, S=20, T=1, and =0,3

= 0

= 3

Plaid

Original

Permuted


Real Data To Simulate

Internet traffic flow data

Uncover different types of co-clusters: in-star, out-star, bi-mesh,....

Examples of Email application: department servers, Gmail

Overlapping co-clusters may reveal server farms

Gene expression microarray data

Co-clusters may exhibit some biological patterns

Need to check with the gene enrichment value


Concluding Summary

Plaid models to reveal overlapping co-clustersExploit sparsity for parsimonious recoveryJointly decide among multiple layers

SOC algorithm iteratively updates the unknown parameters Coordinate descent solver for the layer level parametersSphere decoder detects membership indicators jointly

Local sparseness leads to unique decomposition


Future Directions

Implementation issues with parameter choices

Efficient initializations and membership vector detection

Comprehensive numerical experiments on real data

Thank You!

Sparsity-Cognizant Overlapping Co-clustering

Documents

Transcript of Sparsity-Cognizant Overlapping Co-clustering