Sparsity-Cognizant Overlapping Co-clustering
description
Transcript of Sparsity-Cognizant Overlapping Co-clustering
1
Sparsity-Cognizant Overlapping Co-clusteringSparsity-Cognizant Overlapping Co-clustering
March 11, 2010March 11, 2010
Hao ZhuHao ZhuDept. of ECE, Univ. of Minnesota
http:// spincom.ece.umn.edu
Acknowledgements: G. Mateos, Profs. G. B. Giannakis, N. D. Sidiropoulos, A. Banerjee, and G. Leus
NSF grants CCF 0830480 and CON 014658 ARL/CTA grant no. DAAD19-01-2-0011
2 SPinCOM University of Minnesota
Outline
Motivation and context Problem statement and Plaid models Sparsity-cognizant overlapping co-clustering
(SOC) Uniqueness Simulated tests Conclusions and future research
3 SPinCOM University of Minnesota
Context
Dense, approximately constant-valued submatrices
objects
attributes
Co-clustering (biclustering) = two-way clustering
Clustering: partition the objects(samples, rows) based on a similarity criteria on their attributes(features, columns) [Tan-Steinbach-Kumar ’06]
Co-clustering: simultaneous clustering of objects and attributes [Busygin et al ’08]
NP-hard: reduce to ordinary clustering as in k-means
4 SPinCOM University of Minnesota
Context
Application areas
Social network: cohesive subgroups of actors within a network[Wasserman et al ’94]
Internet traffic: dominat host groups with strong interactions [Jin et al ’09]
Bioinformatics: interpertable biological structure in gene expression data [Lazzeroni et al ’02]
Dense, approximately constant-valued submatrices
Co-clustering (biclustering) = two-way clustering
Clustering: partition the objects(samples, rows) based on a similarity criteria on their attributes(features, columns) [Tan-Steinbach-Kumar ’06]
Co-clustering: simultaneous clustering of objects and attributes [Busygin et al ’08]
NP-hard: reduce to ordinary clustering as in k-means
5 SPinCOM University of Minnesota
Related Works and our Focus
Matrix factorization based on SVD
Probabilistic model for co-cluster membership indicators and parameters
Bipartite spectral graph partitioning [Dhillon ’01]
Overlapping co-clustering under Bayesian framework [Fu-Banerjee ’09]
Orthogonal nonnegative matrix factorization (tNMF) [Ding et al ’06]
EM algorithm for inference and parameter estimation
Search for non-overlapping co-clusters using orthogonality
E-step: Gibbs sampling for membership indicator detection
Plaid models [Lazzeroni et al ’02] Superposition of multiple overlapping layers (co-clusters)
Greedy layer search: one at a time
6 SPinCOM University of Minnesota
Related Works and our Focus (cont’d)
Borrow plaid model features
Sparse information hidden in large data set
Our focus: Sparsity-cognizant overlapping co-clustering (SOC) algorithm
Overlapping: some objects/attributes may relate to multiple co-clusters
Exploit sparsity in co-cluster membership vectors
Partial co-clustering motivated by “Uninteresting background”
Parsimonious models: more interpretable and informative
Linear model requires low computational burden
Simultaneous cross-layer optimization
Compared with greedy layer-by-layer strategy
7 SPinCOM University of Minnesota
Modeling Matrix Y: n × p
Ex-1: Internet traffic: traffic activity graph (TAG)1
induced by two groups of interacting nodes and
Yij measures the strength of the relationship between and
Track the traffic flow between inside host and outside host
Ex-2: Gene expression microarray data2
Measure the level with which gene is expressed in sample
1,2 The two pictures are taken from [Jin et al ’09] and [Lazzeroni et al ’02], respectively.
8 SPinCOM University of Minnesota
Submatrices
Hidden dense/uniform submatrices
capture a subset of that has similar feature values related to a subset of
reveal certain informative behavioral patterns
Features
distributed “sparsely” in Y compared to the data dimension np
may overlap because of some multiple-functioned nodes
Goal: Efficient co-clustering algorithms to extract the underlying submatrices, by exploiting sparsity and accounting for possible overlapping
9 SPinCOM University of Minnesota
Plaid Models
Matrix Y: superposition of k submatrices (layers)
l common to all the nodes in the layer
l : level of layer (0 background)
il (jl) =1 if ( ) is in the -th layer
0 otherwise
Row/column-level related effects il and jl
il and jl express the node-related response
10 SPinCOM University of Minnesota
Problem Statement
Problem: Given the plaid model, seek the optimal membership indicators
Efficient sub-optimal algorithm to identify the submatrices jointly Recall the submatrices are detected one at a time in [Lazzeroni et al ’02]
Data fitting error penalized by the L1 norm of the indicators
Facilitate extraction of the more informative/inpretable submatrices out of Y
The optimal solution is NP-hard Binary constraints on membership indicators [Tuner et al ’05] Product of different variables
> 0 controls the sparsity enforced
11 SPinCOM University of Minnesota
Sparsity-cognizant overlapping co-clustering (SOC)
Different from [Lazzeroni et al ’02]
Per iteration s, (s) collects all ijk(s) values, likewise for (s) and (s)
Background-layer-free residue matrix Z
Iterative cycling update of , , and
(s-1)
(s-1)(s)
(s-1)(s)
(s)
(s)
All the k layers are updated jointly, less prone to error propagation across layers
Membership indicators are updated with binary constraint combinatorial complexity
12 SPinCOM University of Minnesota
Updating (s)
Coordinate descent algorithm alternating across all the layers
Inversion of a large matrix leads to numerical instablity
Given (s-1) and (s-1)
Unconstrained quadratic programming closed-form solution
For l = 1, ..., k
Update for T cycles (T small)
Define residue matrix :
Reduce to by extracting from the rows il(s-1)=1 and the columns jl
(s-1)=1
13 SPinCOM University of Minnesota
Updating (s) and (s)
L1 norm penalty reduces to linear term due to non-negativity
Important for overlapping submatrices to eliminate cross effects
Given (s) and (s-1), determine (s)
Obtain jointly membership indicators for the i -th row
Quadratic minimization subject to {0,1} binary constraints NP-hard
Similar problems in MIMO/multiuser detection with binary alphabet
(Near-) optimal sphere decoding algorithm (SDA)
Incurs polynomial (cubic) complexity in general
Same techniques to detect (s) with (s) and (s)
14 SPinCOM University of Minnesota
Convergence and Implementation
Initialization
Data fitting error cost: bounded below and non-increasing per iteration
SOC algorithm converges (at least) to a stationary point
Background level fitting to obtain matrix Z (Recall the submatrix parameter fitting)
Parameter choices
Number of layers k : explain a certain percentage of variation
Pruning steps [Lazzeroni et al ’02], [Tuner et al ’05]
Membership indicators (0) and (0) : K-means [Tuner et al ’05]
Sparse regularization parameter : trial-and-error/bi-cross-validation [Witten et al ’09]
15 SPinCOM University of Minnesota
Uniqueness
Plain plaid models: decomposition into product of unknown matrices
Binary-valued matrices: R=[il] (n ×k) and K =[jl] (p ×k)
Diagonal matrix D = diag(1 , ... , k)
Blind source separation (BSS) [Talwar et al ’96], [van der Veen et al ’96]
Product of two matrices, finite alphabet (FA)/constant modulus (CM) constraint (Generally) uniquely identifiable with enough number of samples
3-way array (Candecomp/Parafac) [Kruskal ’77], [Sidiropoulos et al ’00]
two-way
Unique up to permutation and scaling:
Fails to hold if h = 1
16 SPinCOM University of Minnesota
Uniqueness (cont’d)Sparsity in blind identification
Sparse component analysis: very sparse representation [Georgiev et al ’07] Non-negative source separation using local dominance [Chan et al ’08]
Proposition:Proposition: Consider where diagonal matrix D and binary-valued matrices R, K are all of full rank. Each column vector kl of K is locally sparse 8l,
which means there exists an (unknown) row index jl such that
Given Z, the matrices R, D, and K are unique up to permutations.
Proof relies on convex analysis Affine hull of column vectors of K coincides with the one of Z
Under local sparseness, its convex hull becomes the intersection of the affine hull and the positive orthant
Columns of K are extreme points of convex hull
Results hold also when R is locally sparse (symmetry)
17 SPinCOM University of Minnesota
Preliminary Simulation Two uniform blocks + noises ~ Unif [0, 0.5]
SOC parameters: k=2, S=20, T=1, and =0,3
= 0
= 3
Plaid
Original
Permuted
18 SPinCOM University of Minnesota
Real Data To Simulate
Internet traffic flow data
Uncover different types of co-clusters: in-star, out-star, bi-mesh,....
Examples of Email application: department servers, Gmail
Overlapping co-clusters may reveal server farms
Gene expression microarray data
Co-clusters may exhibit some biological patterns
Need to check with the gene enrichment value
19 SPinCOM University of Minnesota
Concluding Summary
Plaid models to reveal overlapping co-clustersExploit sparsity for parsimonious recoveryJointly decide among multiple layers
SOC algorithm iteratively updates the unknown parameters Coordinate descent solver for the layer level parametersSphere decoder detects membership indicators jointly
Local sparseness leads to unique decomposition
20 SPinCOM University of Minnesota
Future Directions
Implementation issues with parameter choices
Efficient initializations and membership vector detection
Comprehensive numerical experiments on real data
Thank You!