Microarray data analysis - ULisboa · PDF fileMicroarray data analysis DNA polymer: ... detect...

43
1 Computational Biology Ana Teresa Freitas 2015/2016 Microarray data analysis DNA polymer: the double helix

Transcript of Microarray data analysis - ULisboa · PDF fileMicroarray data analysis DNA polymer: ... detect...

1

Computational Biology Ana Teresa Freitas

2015/2016

Microarray data analysis

DNA polymer: the double helix

2

Application example: gene expression analysis to identify genes involved in disease

Expression analysis by microarray a, One-color expression analysis uses a single fluorescent label (green wavy lines) and two chips to generate expression profiles for two or more cell samples. b, Two-color expression analysis uses two different fluorescent labels (green and red wavy lines) and a single chip to generate expression profiles for two different cell samples.

Gene expression: process by which genetic information at the DNA level is converted into functional proteins

DNA chips

Solid flat surface containing probe DNA molecules in a matrix-like pattern (array), to detect complementary DNA strands

DNA probe DNA labeled target

Biological Sample

Biological Information

• Gene Expression

• Disease Diagnosis

• Drug Discovery

• Pharmacology

3

Microarrays l  Rows represent

genes l  Columns represent

samples

l  Many problems may be solved using clustering

l  Example of microarray dataset

Final  Chip  

Wafer

Chip

Feature

4

DNA chips - detection

Incyte, www.incyte.com

• Target DNA is labeled with fluorescent molecules.

• Detection of hybridization by laser scanning and imaging.

Applications Development/gene expression

Microarrays used to examine patterns of gene expression in different human tissues

5

Applications Human disease

Study genetic basis of desease by comparing gene expression patterns from normal individuals and patients, look at genetic predisposition, onset and regression of disease Diseases studied using microarrays: Cancer (>80%), diabetes, cardiovascular disease, Alzheimer’s, stroke, AIDS, cystic fibrosis, Parkinson’s, autism, anemia

Applications Genetic screening and diagnostics

Check for disease-causing sequence variants so that people can check their genetic material for treatable and curable genetic diseases

6

Applications Drug discovery

Gene expression profiles in patients undergoing drug treatment, patient genotyping to check for drug response profile

DNA Chips: the experiment

target

7

Microarray data

Expression levels of gene i, across samples

Gi

Expression levels of all genes, for one sample

Sj

Typical examples of samples: Heat shock, phases in cell cycle, cancer, normal, …

Microarray data

Genes

mRNA samples

Gene expression level of gene i in mRNA sample j

Log (treated-exp-value /controlled-exp-value )

sample1 sample2 sample3 sample4 sample5 …1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...

8

What do we actually measure?

l  We measure signal of cDNA target(s) which hybridize(s) to the probe (and backgrounds, ratios, standard deviations, dust etc.…)

l  What do we wish to know (an abstraction)? [mRNA]1a , [mRNA] 1b ,….. [mRNA]Na , [mRNA] Nb

Where N = Number of Genes, a and b = different colors

Factors with impact on the signal level

l  Amount of mRNA l  Labeling efficiencies l  Quality of the RNA l  Laser/dye combination l  Detection efficiency of photomultiplier …

9

Typical Assumption

[mRNA]n,a α signaln,a

“Normalization constant”

[mRNA]n,a = k * signaln,a

n = gene indexa = color

Low level analysis l  Image analysis - computation of probes’

intensities/signals

l  Normalization - is the attempt to compensate for systematic technical differences between chips, to see more clearly the systematic biological differences between samples. Statisticians use the term 'bias' to describe systematic errors, which affect a large number of genes.

10

Normalization

l Sources of Systematic Errors l  Different incorporation efficiency of dyes l  Different amounts of mRNA l  Experimenter/protocol issues (comparing

chips processed by different labs) l  Different scanning parameters l  Batch bias

Normalization

l  Two problems:

l  How to detect biases? Which genes to use for estimating biases among chips?

l  How to remove the biases?

11

Which genes to use for bias detection?

•  All genes on the chip l  Assumption: Most of the genes are equally

expressed in the compared samples, the proportion of the differential genes is low (<20%).

l  Limits: l  Not appropriate when comparing highly

heterogeneous samples (different tissues) l  Not appropriate for analysis of ‘dedicated

chips’ (apoptosis chips, inflammation chips etc)

Which genes to use for bias detection?

•  Housekeeping genes –  Assumption: based on prior knowledge a set of genes

can be regarded as equally expressed in the compared samples

•  Affy novel chips: ‘normalization set’ of 100 genes •  NHGRI’s cDNA microarrays: 70 "house-keeping" genes

set –  Limits:

•  The validity of the assumption is questionable •  Housekeeping genes are usually expressed at high

levels, not informative for the low intensities range

12

Normalization methods

l  Global normalization (Scaling) l  enforces the chips to have equal mean (median) intensity

l  Intensity-dependent normalization (Lowess) l  enforces equal means at all intensities

l  Quantile Normalization l  enforces the chips to have identical intensity distribution

Global normalization (Scaling)

l  A single normalization factor (k) is computed for balancing chips:

Xinorm = k*Xi

l  Multiplying intensities by this factor equalizes the mean (median) intensity among compared chips

13

Global normalization (Scaling)

Before After

Lowess – local linear fit

l  The names "lowess" and "loess" are derived from the term "locally weighted scatter plot smooth," as both methods use locally weighted linear regression to smooth data.

l  Detect Intensity-dependent Biases: M vs A plots l  X axis: A – average intensity

A = 0.5*log(Cy3*Cy5) l  Y axis: M – log ratio

M = log(Cy3/Cy5)

14

We expect the M vs A plot to look like:

A

M = log(Cy3/Cy5)

Intensity-dependent bias

A

M = log(Cy3/Cy5)

Low intensities

M<0: Cy3<Cy5

High intensities

M>0: Cy3>Cy5

* Global normalization cannot remove intensity-dependent biases

15

Intensity-Dependent Normalization

Lowess – fitting local regression curve – c(A)

Xinorm = k(A)*Xi

c(A) = log(k(A))

Quantile Normalization

l  Terry Speed's group introduced a non-parametric procedure normalizing to a synthetic chip

l  Their method assumes that the distribution of gene abundances is nearly the same in all samples

16

Quantile Normalization

l  Sort each column in the data matrix according to genes’ (probes’) intensities in each chip

l  Compute mean intensity in each rank across the chips l  Replace each intensity by the mean intensity at its rank l  Re-order columns to original state, each row corresponds

to a gene

Chip #1 Chip #2 Chip #3 Average chip

What is Cluster Analysis?

l  Cluster: a collection of data objects l  Similar to one another within the same cluster l  Dissimilar to the objects in other clusters

l  Cluster analysis l  Grouping a set of data objects into clusters

l  Clustering is unsupervised classification: no predefined classes

l  Typical applications l  As a stand-alone tool to get insight into data distribution l  As a preprocessing step for other algorithms

17

Things to study (1)

•  Clustering (grouping) genes: i.e., finding groups of co-regulated genes

Expression levels across time of two clusters of co-regulated genesExample:

samples samples

Things to study (2)

•  Clustering (grouping) samples

Groups of similarbehaviour ?

i.e., finding groups of samples with similar genetic profiles (e.g., cancer types).

18

Things to study (3)•  Classifying genes: i.e., deciding if a gene is co-regulated with some known gene(s), based on their expression profiles across samples.

samples

Annotated gene 1

Annotated gene 2

samples

samples

Unknown gene

Co-regulation? Similar biological function? Same transcription factor?

Things to study (4)

•  Classifying samples: i.e., classifying new samples, based on a set of classified samples (example: cancer versus normal; different types of cancer;...)

classified samplesA B samples to be classified

19

Things to study (5)

•  Selecting genes: a) deciding if a given gene, in isolation, behaves differently in a control versus experimental situation (e.g., cancer vs normal, two types of cancer, treatment vs non-treatment).

b) Selecting which group genes is significantly different in a control versus experimental situation (same examples). c) Selecting which group of genes is relevant for a given classification problem.

Clustering methods l  Similarity-based (need a similarity function)

l  Construct a partition l  Agglomerative, bottom up l  Searching for an optimal partition

l  Typically “hard” clustering

l  Model-based (latent models, probabilistic or algebraic) l  First compute the model l  Clusters are obtained easily after having a model l  Typically “soft” clustering

20

Similary-based clustering

l  Define a similarity function to measure similarity between two objects

l  Common criteria: Find a partition to l  Maximize intra-cluster similarity l  Minimize inter-cluster similarity

l  Two ways to construct the partition l  Hierarchical (e.g.,Agglomerative Hierarchical Clustering) l  Search by starting at a random partition (e.g., K-means)

Partitioning Algorithms: Basic Concept

l  Partitioning method: Construct a partition of a database D of n objects into a set of k clusters

l  Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion l  Global optimal: exhaustively enumerate all partitions l  Heuristic methods: k-means and k-medoids algorithms l  k-means (MacQueen’67): Each cluster is represented by the

center of the cluster l  k-medoids or PAM (Partition around medoids) (Kaufman &

Rousseeuw’87): Each cluster is represented by one of the objects in the cluster

21

K-means

Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares (WCSS). where µi is the mean of points in Si.

The K-Means Clustering Method

l  Given k, the k-means algorithm is implemented in four steps: l  Step 1: Partition objects into k nonempty subsets

l  Step 2: Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster)

l  Step 3: Assign each object to the cluster with the nearest seed point

l  Go back to Step 2, stop when no more new assignment

22

The K-Means Clustering Method

l  Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrarily choose K object as initial cluster center

Assign each objects to most similar center

Update the cluster means

Update the cluster means

reassign reassign

The K-Means Clustering Method

The standard algorithm was first proposed by Stuart Lloyd in 1957 as a technique for pulse-code modulation

Given an initial set of k means m1(1),…,mk(1), the algorithm proceeds by alternating between two steps: Assignment step: Assign each observation to the cluster whose mean yields the least within-cluster sum of squares (WCSS). Since the sum of squares is the squared Euclidean distance, this is intuitively the "nearest" mean.

where each x_p is assigned to exactly one S^{(t)}, even if it could be assigned to two or more of them.

23

The K-Means Clustering Method

Update step: Calculate the new means to be the centroids of the observations in the new clusters.

Since the arithmetic mean is a least-squares estimator, this also minimizes the within-cluster sum of squares (WCSS) objective.

The algorithm has converged when the assignments no longer change. Since both steps optimize the WCSS objective, and there only exists a finite number of such partitionings, the algorithm must converge to a local optimum. There is no guarantee that the global optimum is found using this algorithm.

Consider the following expression matrix where the expression levels of 5 genes (G1 to G5) were analyzed in 2 patients (P1 and P2). Use the Euclidean distance as the distance metrics. Determine the groups found by the K-means (K=2) algorithm when it is used to cluster the genes and the centroids are initialized with (0,0) and (1,2). In each iteration of the algorithm present the centroids and the genes in each group (cluster).

The K-Means Clustering Method

24

The K-Means Clustering Method

The running time of Lloyds algorithm is often given as O(n k d i),

where n is the number of d-dimensional vectors, k the number of

clusters and i the number of iterations needed until convergence.

On data that does have a clustering structure, the number of

iterations until convergence is often small, and results only improve

slightly after the first dozen iterations.

Lloyds algorithm is therefore often considered to be of "linear"

complexity in practice.

Comments on the K-Means Method

l  Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.

l  Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))

l  Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms

l  Weakness l  Applicable only when mean is defined, then what about

categorical data?

l  Need to specify k, the number of clusters, in advance l  Unable to handle noisy data and outliers

l  Not suitable to discover clusters with non-convex shapes

25

Variations of the K-Means Method

l  A few variants of the k-means which differ in

l  Selection of the initial k means

l  Dissimilarity calculations

l  Strategies to calculate cluster means

l  Handling categorical data: k-modes (Huang’98)

l  Replacing means of clusters with modes

l  Using new dissimilarity measures to deal with categorical objects

l  Using a frequency-based method to update modes of clusters

l  A mixture of categorical and numerical data: k-prototype method

What is the problem of k-Means Method?

l  The k-means algorithm is sensitive to outliers ! l  Since an object with an extremely large value may substantially

distort the distribution of the data.

l  K-Medoids: Instead of taking the mean value of the object in a

cluster as a reference point, medoids can be used, which is the

most centrally located object in a cluster.

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

26

Agglomerative Hierarchical Clustering

l  Given a similarity function to measure similarity between two objects

l  Gradually group similar objects together in a bottom-up fashion

l  Stop when some stopping criterion is met l  Variations: different ways to compute

group similarity based on individual object similarity

Distance Metrics

l  For clustering algorithms the calculation of a distance between gene vectors or experiment vectors is a necessary step

l  Distances metrics can be classified as •  Metric distances •  Semi-metric distances

l  Metric distances: 1.  dab >= 0 2.  dab = dba 3.  daa = 0 4.  dab <= dac + dcb

l  Semi-metric distances: obey 1) to 3), fail in 4)

27

Distance Metrics

Minkowski distance If q = 1, d is Manhattan distance (semi-metric distance)

If q = 2, d is Euclidean distance (metric distance)

q q

pp

qq

jxixjxixjxixjid )||...|||(|),(2211

−++−+−=

||...||||),(2211 pp jxixjxixjxixjid −++−+−=

)||...|||(|),( 22

22

2

11 pp jxixjxixjxixjid −++−+−=

In probability theory and statistics, correlation, also called correlation coefficient, indicates the strength and direction of a linear relationship between two random variables.

In general statistical usage, correlation or co-relation refers to the departure of two variables from independence, although correlation does not imply causation.

The best known is the Pearson product-moment correlation coefficient, which is obtained by dividing the covariance of the two variables by the product of their standard deviations.

Correlation (cross, auto)

ρxy = measures the tendency of x (t) and y (t) to covary

γ(Δ)normalized

28

standard deviations

Correlation (cross, auto)

ρxy = measures the tendency of x (t) and y (t) to covary

γ(Δ)normalized

Estimating population standard deviation from

sample standard deviation (estimated by examining a random sample taken from the population)

Distance Metrics

Entropy based distances:Mutual Information(semi-metric distance)

•  Mutual Information (MI) is a statistical representation of the correlation of two signals A and B.

•  MI is a measure of the additional information known about one expression pattern when given another.

•  MI is not based on linear models and can therefore also see non-linear dependencies (see picture).

29

Similarity-induced Structure

How to Compute Group Similarity?

Three Popular Methods: Given two groups g1 and g2, Single-link algorithm: s(g1,g2)= similarity of the closest pair Complete-link algorithm: s(g1,g2)= similarity of the farthest pair Average-link algorithm: s(g1,g2)= average of similarity of all pairs

30

Comparison of the Three Methods

l  Single-link l  “Loose” clusters l  Individual decision, sensitive to outliers

l  Complete-link l  “Tight” clusters l  Individual decision, sensitive to outliers

l  Average-link l  “In between” l  Group decision, insensitive to outliers

l  Which one is the best? Depends on what you need!

Hierarchical (agglomerative) clustering.

Strictly speaking, agglomerative clustering does not produce clusters, but a dendogram

2 3 5 1 4

Cutting the dendogram at a certain level yields clusters.

2 3 5 1 4

diss

imila

rity

Dendogram cutting is a problem analogous to the selection of K in K-means clustering.

31

Microarray data from time course of serumstimulation of primary human fibroblasts.

Experiment:Foreskin fibroblasts were grown in culture and weredeprived of serum for 48 hr. Serum was added back andsamples taken at time 0, 15 min, 30 min, 1hr, 2 hr, 3 hr, 4hr, 8 hr, 12 hr, 16 hr, 20 hr, 24 hr.

Clustering:Correlation Coefficient +

Agglomerative clustering (average-link)

Clusters with biological interpretation:(A) cholesterol biosynthesis,(B) the cell cycle,(C) the immediate-early response,(D) signalling and angiogenesis,(E) wound healing and tissue remodelling.

Example of agglomerative gene clustering (Eisen et al, 98)

Data Structures

l  Data matrix

l  Dissimilarity matrix

⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢

npx...nfx...n1x...............ipx...ifx...i1x...............1px...1fx...11x

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

0...)2,()1,(:::

)2,3()

...ndnd

0dd(3,10d(2,1)

0

32

Biclustering: Motivation l  Gene expression matrices have been

extensively analyzed using clustering in one of two dimensions l  The gene dimension l  The condition dimension

l  This correspond to the l  Analysis of expression patterns of genes by

comparing rows in the matrix. l  Analysis of expression patterns of samples by

comparing columns in the matrix.

Biclustering: Motivation

l  Common objectives pursued when analyzing gene expression data include: 1.  Grouping of genes according to their expression under

multiple conditions. 2.  Classification of a new gene, given its expression and the

expression of other genes, with known classification. 3.  Grouping of conditions based on the expression of a

number of genes. 4.  Classification of a new sample, given the expression of the

genes under that experimental condition.

33

Biclustering: Motivation l  A large number of clustering approaches have been proposed for

the analysis of gene expression data.

l  Clustering can be used to group either genes or conditions and

pursue objectives 1 and 3, directly, and objectives 2 and 4,

indirectly.

l  The results of the application of standard clustering techniques to

genes are limited by the existence of a number of experimental

conditions where the activity of genes is uncorrelated.

Motivation l  A similar limitation exists when clustering of conditions is

performed.

l  Many activation patterns are common to a group of genes only

under specific experimental conditions.

l  Discovering such local expression patterns may be the key to

uncovering many genetic pathways that are not apparent

otherwise.

l  It is therefore highly desirable to move beyond the clustering

paradigm and develop approaches capable of discovering local

patterns in microarray data (Ben-Dor et all, 2002).

34

What is Biclustering?

l  Biclustering = Simultaneous clustering of both rows and columns of

a data matrix.

l  Concept can be traced back to the 70’ (Hartigan, 1972), although it

has been rarely used or studied.

l  The term was introduced by (Cheng and Church, 2000) who were

the first to used it in gene expression data analysis.

l  Technique used in other fields, such as collaborative filtering,

information retrieval and data mining.

What is Biclustering?

l  We consider a n by m data matrix, A=(X,Y), where l  X={x1,..., xn} = Set of n rows l  Y={y1,..., ym} = Set of m columns l  aij = numeric value (discrete or real) representing the relation

between row i and column j. l  In the case of gene expression matrices

l  X = Set of Genes l  Y = Set of Conditions l  aij = expression level of gene i under condition j (real value).

35

What is Biclustering?

anm ... anj ... an1 Gene n

... ... ... ... ... ...

aim ... aij ... ai1 Gene i

... ... ... ... ... ...

a1m ... a1j ... a11 Gene 1

Condition m ... Condition j ... Condition

1

A = (X,Y)

Gene Expression Matrix

What is Biclustering? Given the matrix A = (X,Y) I = Subset of rows J = Subset of columns l  (I,Y) = a subset of rows that exhibit similar behavior

across the set of all columns = cluster of rows l  (X,J) = a subset of columns that exhibit similar

behavior across the set of all rows = cluster of columns

36

What is Biclustering?

l  (I,J) = a subset of rows and a subset of columns, where the rows exhibit similar behavior across the columns and vice-versa.

= sub-matrix of A that contains only the elements aij with set of rows I and set of columns J.

= bicluster l  We want to identify a set of biclusters Bk = (Ik,Jk). l  Each bicluster Bk must satisfies some specific

characteristics of homogeneity.

a69

a59

a49

a39

a29

a19

C9

G6

G5

G4

G3

G2

G1 a110

a18 a17 a16 a15 a14 a13 a12 a11

a610

a68 a67 a66 a65 a64 a63 a62 a61

a510

a58 a57 a56 a55 a54 a53 a52 a51

a410

a48 a47 a46 a45 a44 a43 a42 a41

a310

a38 a37 a36 a35 a34 a33 a32 a31

a210

a28 a27 a26 a25 a24 a23 a22 a21

C10 C8 C7 C6 C5 C4 C3 C2 C1

X = {G1, G2, G3, G4, G5, G6}

Y= {C1, C2, C3, C4, C5, C6, C7, C8, C9, C10}

I = {G2, G3, G4}

J = {C4, C5, C6}

Bicluster (I,J)

{{G2, G3, G4}, {C4, C5, C6}}

Cluster of Rows (I,Y)

{G2, G3, G4}

Cluster of Columns (X,J)

{C4, C5, C6}

What is Biclustering?

37

What is Biclustering? l  Biclustering Goals

l  Perform simultaneous clustering on the row and column dimensions of the gene expression matrix instead of clustering the rows and columns separetely.

l  Identify sub-matrices (subsets of rows and subsets of columns) with interesting properties.

l  Gene Expression Data Analysis l  Identify subgroups of genes and subgroups of

conditions, where the genes exhibit highly correlated activities for every condition

Bicluster Types l  An interesting criteria to evaluate a biclustering algorithm

concerns the identification of the type of biclusters the algorithm

is able to find.

l  There are four major classes of biclusters

1.  Biclusters with constant values.

2.  Biclusters with constant values on rows or columns.

3.  Biclusters with coherent values.

4.  Biclusters with coherent evolutions.

38

Constant Values

1.0 1.0 1.0 1.0

1.0 1.0 1.0 1.0

1.0 1.0 1.0 1.0

1.0 1.0 1.0 1.0

4.0 4.0 4.0 4.0

3.0 3.0 3.0 3.0

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0

4.0 3.0 2.0 1.0

4.0 3.0 2.0 1.0

4.0 3.0 2.0 1.0

4.0 3.0 2.0 1.0

Constant Rows Constant Columns

Constant Values on Rows or Columns

39

Coherent Values

4.0 9.0 6.0 5.0

3.0 8.0 5.0 4.0

1.0 6.0 3.0 2.0

0.0 5.0 2.0 1.0

4.5 1.5 6.0 3.0

6.0 2.0 8.0 4.0

3.0 1.0 4.0 2.0

1.5 0.5 2.0 1.0

Additive Model Multiplicative Model

Coherent Evolutions

S1 S1 S1 S1

S1 S1 S1 S1

S1 S1 S1 S1

S1 S1 S1 S1

S4 S4 S4 S4

S3 S3 S3 S3

S2 S2 S2 S2

S1 S1 S1 S1

Overall Coherent

Evolution

Coherent Evolution

On the Rows

40

Coherent Evolutions

S4 S3 S2 S1

S4 S3 S2 S1

S4 S3 S2 S1

S4 S3 S2 S1

12 20 15 90

15 27 20 40

35 49 40 49

10 19 13 70

Coherent Evolution

On the Columns

Order Preserving

Sub-Matrix (OPSM)

Algorithms

l  When this is the case, a bicluster corresponds to a biclique in the corresponding bipartite graph.

l  Finding a maximum size bicluster l  Is equivalent to finding the maximum edge biclique in a

bipartite graph. l  This problem is known to be NP-complete (Peeters, 2003).

l  More complex cases l  Where the actual numeric values in the matrix A are taken

into account to compute the quality of a bicluster l  Have a complexity that is necessarily no lower than this

simpler case.

41

Algorithms

l  Given this, the large majority of the algorithms use heuristic approaches to identify biclusters.

l  In many cases the algorithm is preceded by a normalization step that is applied to the data matrix. l  The goal is to make more evident the patterns of

interest. l  Some algorithms avoid heuristics but exhibit

an exponential worst case runtime.

Algorithms l  Different Objectives

l  Identify one bicluster. l  Identify a given number of biclusters.

l  Different Approaches l  Discover one bicluster at a time. l  Discover one set of biclusters at a time. l  Discover all biclusters at the same time

(Simultaneous bicluster identification)

42

Algorithms: Heuristic Approaches

l  Iterative Row and Column Clustering Combination l  Apply clustering algorithms to the rows and columns of the

data matrix, separately. l  Combine the results using some sort of iterative procedure

to combine the two cluster arrangements.

l  Divide and Conquer l  Break the problem into several subproblems that are similar

to the original problem but smaller in size. l  Solve the problems recursively.

Algorithms: Heuristic Approaches

l  Combine the intermediate solutions to create a solution to the original problem.

l  Usually break the matrix into submatrices (biclusters) based on a certain criterion and then continue the biclustering process on the new submatrices.

l  Greedy Iterative Search l  Always make a locally optimal choice in the hope that this

choice will lead to a globally good solution. l  Usually perform greedy row/column addition/removal.

43

Algorithms

l  Exhaustive Bicluster Enumeration l  A number of methods have been used to speed up

exhaustive search. l  In some cases the algorithms assume restrictions

on the size of the biclusters that should be listed.

Biclustering Applications

l  Biological Applications l  Microarray Data Analysis

l  Yeast Saccharomyces Cerevisiae cell cycle: 2884 genes and 17 conditions

l  Human B-cells: 4026 genes and 96 conditions. l  Acute Leukemia patients: 6817 human genes and 72

samples. l  Colon tumor: 62 samples and 6500 human genes. l  Multiple Sclerosis patients: 4132 genes and 48 samples. l  Breast tumor: 3226 genes under 22 experimental

conditions.