An Unsupervised Learning Approach for Overlapping Co-clustering Machine Learning Project...

An Unsupervised Learning Approach for

Overlapping Co-clustering

Machine Learning Project PresentationRohit Gupta and Varun Chandola

{rohit,chandola}@cs.umn.edu

Outline

• Introduction to Clustering• Description of Application Domain• From Traditional Clustering to Overlapping Co-clustering• Current State of Art• A Frequent Itemsets Based Solution• An Alternate Minimization Based Solution• Application to Gene Expression Data• Experimental Results• Conclusions and Future Directions

Clustering• Clustering is an unsupervised machine learning

technique Uses unlabeled samples

• In the simplest form – determine groups (clusters) of data objects such that the objects in one cluster are similar to each other and dissimilar to objects in other clusters Where each data object is a set of attributes (or features) with a

definite notion of proximity

• Most traditional clustering algorithms Are partitional in nature. Assign a data object to exactly one

cluster Perform clustering along one dimension

Application Domains

• Gene Expression Data Genes vs. Experimental Conditions Find similar genes based on their expression values

for different experimental conditions Each cluster would represent a potential functional

module in the organism

• Text Documents Data Documents vs. Words

• Movie Recommendation Systems Users vs. Movies

Overlapping Clustering

• Also known as soft clustering, fuzzy clustering

• A data object can be assigned to more than one cluster

• Motivation is that many real world data sets have inherently overlapping clusters A gene can be a part of multiple functional modules (clusters)

Co-clustering

• Co-clustering is the problem of simultaneously clustering rows and columns of a data matrix Also known as bi-clustering, subspace clustering, bi-dimensional

clustering, simultaneous clustering, block clustering

• The resulting clusters are blocks in the input data matrix

• These blocks often represent more coherent and meaningful clusters Only a subset of genes participate in any cellular

process of interest that is active for only a subset of conditions

Overlapping Co-clustering

overlaps Co-clusters

[Bergmann et al, 2003]

Segal et al, 2003, Banerjee et al, 2005 Dhillon et al, 2003, Cho et al 2004, Banerjee et al, 2005

Overlapping Co-clusters

Current State of Art

• Traditional Clustering – Numerous algorithms like k-means

• Overlapping Clustering – Probabilistic Relational Model Based Approach by Segal et al and Banerjee et al

• Co-clustering – Dhillon et al for gene expression data and document clustering. (Banerjee et al provided a general framework using a general class of Bregman distortion functions)

• Overlapping co-clustering Iterative Signature Algorithm (ISA) by Bergmann et al for gene

expression data- Uses an Alternate Minimization technique- Involves thresholding after every iteration

We propose a more formal framework based on the co-clustering approach by Dhillon et al and another simpler frequent itemsets based solution

Frequent Itemsets Based Approach

• Based on the concept of frequent itemsets from association analysis domain A frequent itemset is a set of items (features) which occur together more

than a specified number of times (referred to as support threshold) in the data set

• The data has to be binary (only presence or absence is considered)

Frequent Itemsets Based Approach (2)

Application to gene expression data:

• Normalization – first along columns (conditions) to remove scaling effects and then along rows (genes)

• Binarization- Values above a preset threshold λ are set to 1 and the rest to 0.- Values above a preset percentile are set to 1 and the rest to 0.- Split each gene column to three components g+, g0 and g- signifying

the up and down regulation of the gene's expression. This triples the number of items (or genes)

• Gene expression matrix converted to transaction format data – each experiment is a transaction and contains index values for the genes that were expressed in this experiment

Algorithm:

• Run closed frequent itemset algorithm to generate frequent closed itemsets with a specified support threshold σ

Post-Processing:

• Prune frequent itemsets (set of genes) of length < α• For each remaining itemset, scan the transaction data

to record all the transactions (experiments) in which this itemset occurs

(Note: The combination of these transactions (experiments) and the itemset (genes) will give the desired sets of genes with subsets of conditions they are most tightly co-expressed with)

Frequent Itemsets Based Approach (3)

• Binarization of the gene expression matrix may lose some of the patterns in the data

• Up-regulation and down-regulation of genes not directly taken into account

• Setting up right support threshold incorporating the domain knowledge is not trivial

• Large number of modules obtained – difficult to evaluate biologically

• Traditional association analysis based approaches only considers dense blocks, noise may break the actual module in this case – Error tolerant Itemsets (ETI) offers a potential solution though

Limitations of Frequent Itemsets Based Approach

• Extends the non-overlapping co-clustering approach by [Dhillon et al, 2003, Banerjee et al 2005]

• Algorithm

Input: Data Matrix A (size: m x n) and k, l (# of row and column clusters)

Initialize row and column cluster mappings, X (size: m x k) and Y (size: n x l)

Random assignment of rows (or columns) to row (or column) clusters

Any traditional one dimensional clustering can be used to initialize X and Y

Objective function: ||A – Â||2, Â is matrix approximation of A computed as follows:

Each element of a co-cluster (obtained using current X and Y) is replaced by mean of co-cluster (aI,J)

Each element of a co-cluster is replaced by (ai,J + aI,j – aI,J) i.e row mean + column mean – overall mean

Alternate Minimization (AM) Based Approach

While (converged)

- Phase 1:

Compute row cluster prototypes (based on current X and matrix A)

Compute Bregman distance, dΦ(ri, Rr) - each row to each row cluster prototype

Compute probability with which each of m rows fall into each of k row clusters

Update row cluster X keeping column cluster Y same (some thresholding is required here to allow limited overlap)

Alternate Minimization (AM) Based Approach(2)

- Phase 2:

Compute column cluster prototypes (based on current Y and matrix A)

Compute Bregman distance, dΦ(cj, Cc) - each column to each column cluster prototype

Compute probability with which each of n columns fall into each of l column clusters

Update column cluster Y keeping row cluster X same

- Compute objective function: ||A – Â||2

- Check convergence

• Each row or column can be assigned to multiple row and column clusters respectively by certain probability based on their distances from respective cluster prototypes. This will produce overlapping co-clustering.

• Maximum overlapping co-clusters that could be obtained = k x l

• Initialization of X and Y can be done in multiple ways – two ways are explored in the experiments

• Thresholding to control percent overlap is tricky and requires domain knowledge

• Cluster Evaluation is important – internal and external

SSE, Entropy of each co-cluster

Biological evaluation using GO (Gene Ontology) for results on gene expression data

Observations

Experimental Results (1)

• Frequent Itemsets Based Approach A synthetic data set (40 X 40)

Total Number of co-clusters detected = 3


• Frequent Itemsets Based Approach Another synthetic data set (40 X 40)

Total Number of co-clusters detected = 7

All 4 blocks (in the original data set) were detected

Need post-processing to eliminate unwanted co-clusters


• AM Based Approach Synthetic data sets (20 X 20) Finds co-clusters for each case

Experimental Results (4)• AM Based Approach on Gene Expression Dataset

Human Lymphoma Microarray Data [Described in Cho et al, 2004] # genes = 854 # conditions = 96

• k = 5, l = 5, one dimensional k-means for initialization of X and Y • Total Number of co-clusters = 25

Input Data Objective Function vs. Iterations

A preliminary analysis of the 25 co-clusters show that only one meaningful co-cluster is obtained

Conclusions

• Frequent Itemsets based approach is guaranteed to find dense overlapping co-clusters Error Tolerant Itemset Approach offers a potential solution to

address the problem of noise• AM based approach is a formal algorithm to find

overlapping co-clusters Simultaneously performs clustering in both dimensions while

minimizing a global objective function Results on synthetic data prove the correctness of the algorithm

• Preliminary results on gene expression data show promise and will be further evaluated A key insight here is that application of these techniques to gene

expression data requires domain knowledge for pre-processing, initialization, thresholding as well as post-processing of the co-clusters obtained

• [Bergmann et al, 2003] Sven Bergmann, Jan Ihmels and Naama Barkai, Iterative signature algorithm for the analysis of large-scale gene expression data, Phys. Rev. E 67, pp 31902, 2003

• [Liu et al, 2004] Jinze Liu, Paulsen Susan, Wei Wang, Andrew Nobel and Jan Prins, Mining Approximate Frequent Itemset from Noisy Data, Proc. IEEE ICDM, pp. 463-466, 2004

• [Cho et al, 2004] H. Cho, I. S. Dhillon, Y. Guan, and S. Sra, Minimum sum-squared residue co-clustering of gene expression data. In Proceedings of SIAM Data Mining Conference, pages 114-125, 2004

• [Dhillon et al, 2003] Inderjit S. Dhillon, Subramanyam Mallela and Dharmendra S. Modha, Information-Theoretic Co-Clustering, Proc. ACM SIGKDD, pp. 89-98, 2003

• [Banerjee et al, 2004] A generalized maximum entropy approach to bregman co-clustering and matrix approximation. In KDD '04: Proceedings of the 10th ACM SIGKDD, pages 509-514, 2004

References

An Unsupervised Learning Approach for Overlapping Co-clustering Machine Learning Project...

Documents

Transcript of An Unsupervised Learning Approach for Overlapping Co-clustering Machine Learning Project...