Metodi Numerici per la Bioinformatica - ... · Metodi numerici per la bioinformatica 21. Biclusters...
Transcript of Metodi Numerici per la Bioinformatica - ... · Metodi numerici per la bioinformatica 21. Biclusters...
Metodi Numerici per la Bioinformatica
A.A. 2008/2009
Biclustering
Francesco Archetti1
Outline
• Motivation
• What is Biclustering?
• Why Biclustering and not just Clustering?
Francesco Archetti
• Bicluster Types
• Algorithms
Francesco Archetti2
Metodi numerici per la
bioinformatica 2
Motivations
• Gene expression matrices have been extensivelyanalyzed using clustering in one of twodimensions– The gene dimension– The gene dimension– The condition dimension
• This correspond to the:– Analysis of expression patterns of genes by comparing
rows in the matrix.– Analysis of expression patterns of samples by
comparing columns in the matrix.
Metodi numerici per la
bioinformatica Francesco Archetti3
Motivations
• Analysis via clustering makes several a prioriassumptions that may not be adequate in allcircumstances:– Clustering can be applied to either genes or samples,– Clustering can be applied to either genes or samples,
implicitly directing the analysis to a particular aspect ofthe system under study (e.g., groups of patients orgroups of co-regulated genes)
– Clustering algorithms usually seek a disjoint cover ofthe set of elements, requiring that no gene or samplebelongs to more than one cluster.
Metodi numerici per la
bioinformatica Francesco Archetti4
Motivations
• the results of the application of standard clusteringtechniques to genes are limited due to the existence of anumber of experimental conditions where the activityof genes is uncorrelated.
• Many activation patterns are common to a group of
Francesco Archetti
• Many activation patterns are common to a group ofgenes only under specific experimental conditions.
• Discovering such local expression patterns may be thekey to uncovering many genetic pathways that are notapparent otherwise.
• It is therefore highly desirable to move beyond theclustering paradigm and develop approaches capable ofdiscovering local patterns in microarray data.
Metodi numerici per la
bioinformatica 5
What is Biclustering?
BICLUSTERBICLUSTER:
• a submatrix spanned by a set of genes (rows) and a set of sample (column)
• given a gene expression matrix, it’s possible to characterize the biological phenomena it
embodies by a collection of biclusters, each representing a different type of joint behavior
of a set of genes in a corresponding set of samples.
Francesco Archetti
Metodi numerici per la
bioinformatica 6
What is Biclustering?
Metodi numerici per la
bioinformatica Francesco Archetti7
• Given the matrix A = (X,Y)
I= Subset of rows
J= Subset of columns
What is Biclustering?
Francesco Archetti
J= Subset of columns
• (I,Y) = a subset of rows that exhibit similar behavior
across the set of all columns = cluster of rows
• (X,J) = a subset of columns that exhibit similar
behavior across the set of all rows = cluster of
columns
Metodi numerici per la
bioinformatica 8
Biclustering Goals:
• find a set of significant biclusters in a matrix: identifysub-matrices (subsets of rows and subsets of columns)with interesting properties.
What is Biclustering?
Francesco Archetti
• Perform simultaneous clustering on the row andcolumn dimensions of the gene expression matrixinstead of clustering the rows and columns separetely.
• Gene Expression Data Analysis
• Identify subgroups of genes and subgroups ofconditions, where the genes exhibit highly correlatedactivities for every condition
Metodi numerici per la
bioinformatica 9
Why Biclustering and not just Clustering?
• Clustering– Can be applied to either the rows or the columns of the
data matrix, separately.
– Produce either clusters of rows (subgroups of rows) orclusters of columns (subgroups of columns).
gene
ral
gene
ralm
odel
sm
odel
s
Francesco Archetti
clusters of columns (subgroups of columns).
• Biclustering– Perform simultaneous clustering of both rows and columns
of the data matrix.
– Produce biclusters (subgroups of rows and subgroups ofcolumns)
Metodi numerici per la
bioinformatica 10
gene
ral
gene
ral
loca
llo
calm
odel
sm
odel
s
Unlike Clustering :
• Biclustering identifies groups of genes that show similar activity
patterns under a specific subset of the experimental conditions.
Biclustering is the key technique to use when:
Why Biclustering and not just Clustering?
Biclustering is the key technique to use when:
• Only a small set of the genes participates in a cellular process of
interest.
• An interesting cellular process is active only in a subset of the
conditions.
• A single gene may participate in multiple pathways that may or not
be co-active under all conditions.
Metodi numerici per la
bioinformatica Francesco Archetti11
Gene AGene BGene C
1 2 3 4 5 6 7 8 9 10 Clustering…
1 2 3 5 7 10
Biclustering V’s Clustering
Gene CGene DGene E Gene F Gene GGene H Gene IGene JGene KGene LGene M
Similarity does not exist over all attributes…Solution: Cluster both Row and Columns Simultaneously - Biclustering
1 2 3 5 7 10
Gene AGene BGene CGene DGene KGene L
Bicluster {1,2,3,5,7,10} {A,B,C,D,E,F}
Biclustering characteristics
Biclustering algorithms should identify groups of genes and conditions,obeying the following rules:• A cluster of genes should be defined with respect to only a subset of the
conditions.• A cluster of conditions should be defined with respect to only a subset of the
genes.• The clusters should not be exclusive and/or exhaustive
Francesco Archetti
• The clusters should not be exclusive and/or exhaustive• There are no a-priori constraints on the organization of biclusters: a gene or
condition should be able to belong to more than one bicluster or to nobicluster at all.
• The lack of structural constrains on biclustering solutions allows greaterfreedom but is consequently more vulnerable to overfitting
• biclustering algorithms must guarantee that the output biclusters aremeaningful accompanying statistical model or a heuristic scoring method thatdefine which of the many possible submatrices represent a significantbiological behavior.
Metodi numerici per la
bioinformatica 13
Biclustering: clinical application
• In clinical applications, gene expression analysis is done on tissuestaken from patients with a medical condition. Using such assays,biologists have identified molecular fingerprints that can help in theclassification and diagnosis of the patient status and guide treatmentprotocols.
• the focus is: identify profiles of expression over a subset of the genesthat can be associated with clinical conditions and treatment
• the focus is: identify profiles of expression over a subset of the genesthat can be associated with clinical conditions and treatmentoutcomes, where ideally, the set of samples is equal in all but thesubtype or the stage of the disease.
• However, a patient may be a part of more than one clinical group,e.g., may suffer from syndrome A, have a genetic background B andbe exposed to environment C.
• Biclustering analysis is thus highly appropriate for identifying anddistinguishing the biological factors affecting the patients along withthe corresponding gene subsets.
Metodi numerici per la
bioinformatica Francesco Archetti14
Biclustering: functional genomics application
• Goal: understand the functions of each of the genes operating in a biologicalsystem.
• The rationale is that genes with similar expression patterns are likely to beregulated by the same factors and therefore may share function.
• By collecting expression profiles from many different biological conditions• By collecting expression profiles from many different biological conditionsand identifying joint patterns of gene expression among them, researchershave characterized transcriptional programs and assigned putative function tothousands of genes.
• Since genes have multiple functions, and since transcriptional programs areoften based on combinatorial regulation, biclustering is highly appropriate forthese applications as well.
• An important aspect of gene expression data is their high noise levels:biclustering algorithms should be robust enough to cope with significantlevels of noise
Metodi numerici per la
bioinformatica Francesco Archetti15
Bicluster Types
An interesting criteria to evaluate a biclustering algorithm
concerns the identification of the type of biclusters the algorithm
is able to find.
We identified four major classes of biclusters:
Francesco Archetti
We identified four major classes of biclusters:
1. Biclusters with constant values.
2. Biclusters with constant values on rows or columns.
3. Biclusters with coherent values.
4. Biclusters with coherent evolutions.
Metodi numerici per la
bioinformatica 16
Bicluster Types
• According to the specific properties of each problem
– One or more of these different types of biclusters are
generally considered interesting.
– A different type of merit function should be used to
Francesco Archetti
– A different type of merit function should be used to
evaluate the quality of the biclusters identified.
• The choice of the merit function is strongly related with
the characteristics of the biclusters each algorithm
aims at finding.
Metodi numerici per la
bioinformatica 17
Biclusters with constant values
• The simplest biclustering algorithms identify subsets of rows andsubsets of columns with constant values.
• A perfect constant bicluster is a sub-matrix (I,J) where allvalues within the bicluster are equal for all i∈I and j∈J:
Francesco Archetti
∈ ∈
• The merit function used to compute and evaluate constantbiclusters is, in general, the variance or some metric based on it.
Metodi numerici per la
bioinformatica 18
aij= µaij= µa a a a
a a a a
a a a a
a a a a
Biclusters with constant values on rows
• A perfect bicluster with constant rows: is a sub-matrix (I,J)
where all values within the bicluster can be obtained using one
of the following expressions:
aij= µ +αiaij= µ +αia a a a a a a a
Metodi numerici per la
bioinformatica Francesco Archetti19
aij= µ +αi
aij= µ x αi
aij= µ +αi
aij= µ x αi
a+i a+i a+i a+i
a+j a+j a+j a+j
a+k a+k a+k a+k
a x i a x i a x i a x i
a x j a x j a x j a x j
a x k a x k a x k a x k
• A bicluster with constant values in the rows identifies a subset of geneswith similar expression values across a subset of conditions, allowing theexpression levels to differ from gene to gene.
Where:
• µ is the typical value within the bicluster
• α is the adjustment for row i ∈ I.
Biclusters with constant values on columns
• A perfect bicluster with constant columns: is a sub-matrix
(I,J) where all values within the bicluster can be obtained using
one of the following expressions:
aij = µ + βjaij = µ + βja a+i a+j a+k a a x i a x j a x k
Metodi numerici per la
bioinformatica Francesco Archetti20
aij = µ + βj
aij = µ x βj
aij = µ + βj
aij = µ x βj
• A bicluster with constant values in the columns identifies a subset ofconditions within which a subset of genes present similar expression valuesassuming that the expression values may differ from condition to condition.
Where:
•µ is the typical value within the bicluster
•β is the adjustment for column j ∈ J.
a a+i a+j a+k
a a+i a+j a+k
a a+i a+j a+k
a a x i a x j a x k
a a x i a x j a x k
a a x i a x j a x k
Biclusters with constant values on rows or columns
• The straightforward approach to identify non-constant
biclusters is to normalize the rows or the columns of the data matrix
using the row mean and the column mean, respectively.
• By doing this, the biclusters with constant rows/columns are
transformed into constant biclusters before the biclustering
Francesco Archetti
transformed into constant biclusters before the biclustering
algorithm is applied.
Metodi numerici per la
bioinformatica 21
Biclusters with coherent values
• A perfect bicluster with coherent values: is defined as a
subset of rows and a subset of columns whose values are
predicted using the following expressions:
a = µ + α + βa = µ + α + β–– ADDITIVEADDITIVE MODELMODEL:
Metodi numerici per la
bioinformatica Francesco Archetti22
aij = µ + αi + βjaij = µ + αi + βj
a b c d
a+i b+i c+i d+i
a+j b+j c+j d+j
a+k b+k c+k d+k
Where: • µ is the typical value within the bicluster• αi is the adjustment for row i ∈ I • βj is the adjustment for row j ∈ J.
Biclusters with coherent values
–– MULTIPLICATIVEMULTIPLICATIVE MODELMODEL:
a = µ’ x α’ x β’a = µ’ x α’ x β’
a b c d
a x i b x i c x i d x i
Metodi numerici per la
bioinformatica Francesco Archetti23
aij = µ’ x α’i x β’jaij = µ’ x α’i x β’ja x i b x i c x i d x i
a x j b x j c x j d x j
a x k b x k c x k d x k
Where: • µ’ is the typical value within the bicluster• α’i is the adjustment for row i ∈ I • β’j is the adjustment for row j ∈ J.
Types of Biclusters : examples
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
Constant values
1.0 1.0 1.0 1.0
2.0 2.0 2.0 2.0
3.0 3.0 3.0 3.0
4.0 4.0 4.0 4.0
Constant values on rows
1.0 2.0 3.0 4.0
1.0 2.0 3.0 4.0
1.0 2.0 3.0 4.0
1.0 2.0 3.0 4.0
Constant values on columns
Metodi numerici per la
bioinformatica Francesco Archetti24
1.0 1.0 1.0 1.0 4.0 4.0 4.0 4.0 1.0 2.0 3.0 4.0
1.0 2.0 5.0 0.0
2.0 3.0 6.0 1.0
4.0 5.0 8.0 3.0
5.0 6.0 9.0 4.0
1.0 2.0 0.5 1.5
2.0 4.0 1.0 3.0
4.0 8.0 2.0 6.0
3.0 6.0 1.5 4.5
Coherent values
Additive model Multiplicative model
General additive models
• For every element aij:
– The general additive model represents a sum of models.
– Each model represents the contribution of the bicluster Bk to the value of aij in
case i∈I and j∈J.
• The general additive model is defined as follows:
∈ ∈
where:
– k is the number of biclusters
– The terms θik and κjk are binary values that represent memberships:
• ρik is the membership of row i in the bicluster k.
• κjk is the membership of column j in the bicluster k.
Metodi numerici per la
bioinformatica Francesco Archetti25
jkik
K
k ijkija κρθ∑ ==
0
The value of θijk specifies the contribution of each bicluster kand can be one of the following expressions:
• µk
• µk + αik
• µk + βjk
General additive models
• µk + βjk
• µk+ αik + βjk
Representing different types of biclusters:
• Constant Biclusters
• Biclusters with constant rows/columns
• Biclusters with additive model
Metodi numerici per la
bioinformatica Francesco Archetti26
General additive models:
GENERAL ADDITIVE MODELS:
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
1.0 1.0
1.0 1.0
3.0 3.0 2.0 2.0
3.0 3.0 2.0 2.0
2.0 2.0 2.0 2.0
1.0 1.0 1.0 1.0
2.0 2.0 2.0 2.0
3.0 3.0
4.0 4.0
8.0 8.0 5.0 5.0
10 10 6.0 6.0
7.0 7.0 7.0 7.0
1.0 2.0 3.0 4.0
1.0 2.0 3.0 4.0
1.0 2.0
1.0 2.0
8.0 10 7.0 8.0
8.0 10 7.0 8.0
5.0 6.0 7.0 8.0
Metodi numerici per la
bioinformatica Francesco Archetti27
2.0 2.0 2.0 2.0
2.0 2.0 2.0 2.0
Constant values
7.0 7.0 7.0 7.0
8.0 8.0 8.0 8.0
Constant rows
5.0 6.0 7.0 8.0
5.0 6.0 7.0 8.0
Constant columns
1.0 2.0 5.0 0.0
2.0 3.0 6.0 1.0
4.0 5.0
5.0 6.0
9.0 5.0 5.0 0.0
11 7.0 6.0 1.0
4.0 5.0 8.0 3.0
5.0 6.0 9.0 4.0
Coherent Values
General multiplicative models
• Similiarly we can also think of a general multiplicative model:
∏ == K
k jkikijkija0
κρθ
where:
– K is the number of biclusters
– The terms θik and κjk are binary values that represent memberships:
• ρik is the membership of row i in the bicluster k.
• κjk is the membership of column j in the bicluster k.
Metodi numerici per la
bioinformatica Francesco Archetti28
The value of θijk specifies the contribution of each bicluster kand can be one of the following expressions:
• µk
• µk x αik
• µk x βjk
General multiplicative models
• µk x βjk
• µk x αik + βjk
Representing different types of biclusters:
• Constant Biclusters
• Biclusters with constant rows/columns
• Biclusters with multiplicative model
Metodi numerici per la
bioinformatica Francesco Archetti29
General multiplicative models
GENERAL MULTIPLICATIVE MODELS:
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
1.0 1.0
1.0 1.0
2.0 2.0 2.0 2.0
2.0 2.0 2.0 2.0
2.0 2.0 2.0 2.0
1.0 1.0 1.0 1.0
2.0 2.0 2.0 2.0
3.0 3.0
4.0 4.0
15 15 5.0 5.0
24 24 6.0 6.0
7.0 7.0 7.0 7.0
1.0 2.0 3.0 4.0
1.0 2.0 3.0 4.0
1.0 2.0
1.0 2.0
15 24 7.0 8.0
15 24 7.0 8.0
5.0 6.0 7.0 8.0
Metodi numerici per la
bioinformatica Francesco Archetti30
2.0 2.0 2.0 2.0
2.0 2.0 2.0 2.0
Constant values
7.0 7.0 7.0 7.0
8.0 8.0 8.0 8.0
Constant rows
5.0 6.0 7.0 8.0
5.0 6.0 7.0 8.0
Constant columns
1.0 2.0 5.0 0.0
2.0 3.0 6.0 1.0
4.0 5.0
5.0 6.0
2.0 12 5.0 0.0
3.0 18 6.0 1.0
4.0 5.0 8.0 3.0
5.0 6.0 9.0 4.0
Coherent Values
1X2
6X2
2X1.5
4.5X4
BICLUSTERING ALGORITHMS
Metodi numerici per la
bioinformatica Francesco Archetti31
Algorithms
• DifferentObjectives
– Identify one bicluster.
– Identify a given number of biclusters.
• DifferentApproaches
– Discover one bicluster at a time.
– Discover one set of biclusters at a time.
– Discover all biclusters at the same time
(Simultaneous bicluster identification)
Metodi numerici per la
bioinformatica Francesco Archetti32
Algorithms:
• Iterative Row and Column Clustering Combination– Apply clustering algorithms to the rows and columns of the data matrix,
separately.
– Combine the results using some sort of iterative procedure to combine the two cluster arrangements.
• Divide and Conquer:– Break the problem into several sub-problems that are similar to the original
problem but smaller in size.
– Solve the problems recursively.
– Combine the intermediate solutions to create a solution to the original problem.
– Usually break the matrix into submatrices (biclusters) based on a certain criterion and then continue the biclustering process on the new submatrices.
Metodi numerici per la
bioinformatica Francesco Archetti33
Algorithms:
• Greedy Iterative Search:
– make a locally optimal choice in the hope that this choice will lead to aglobally good solution.
– Usually perform greedy row/column addition/removal.
• Exhaustive Bicluster Enumeration:
ChengCheng & Church& ChurchAlgorithmAlgorithm
• Exhaustive Bicluster Enumeration:
– The best biclusters are identified using an exhaustive
enumeration of all possible biclusters existent in the data, in
exponential time.
Metodi numerici per la
bioinformatica Francesco Archetti34
Overview of the Biclustering Algorithms
Method Publish Cluster Model Goal
Cheng & Church ISMB 2000 Background + row effect + column effect
Minimize mean squared residue of biclusters
Getz et al.
(CTWC)
PNAS 2000 Depending on plugin clustering algorithm
Depending on plugin clustering algorithm
Lazzeroni & Owen Bioinformatics Background + row effect Minimize modeling error
35
Lazzeroni & Owen
(Plaid Models)
Bioinformatics 2000
Background + row effect + column effect
Minimize modeling error
Ben-Dor et al.
(OPSM)
RECOMB 2002 All genes have the same order of expression values
Minimize the p-values of biclusters
Tanay et al.
(SAMBA)
Bioinformatics 2002
Maximum bounded bipartite subgraph
Minimize the p-values of biclusters
Yang et al.
(FLOC)
BIBE 2003 Background + row effect + column effect
Minimize mean squared residue of biclusters
Kluger et al.
(Spectral)
Genome Res. 2003
Background × row effect × column effect
Finding checkerboard structures
Taken from Kevin Yip, 2003
Overview of the Biclustering Algorithms
Method Allow overlap?
Bicluster Discovery
Complexity Testing Data
Cheng & Church Yes (rare in reality)
One at a time O(MN) or O(MlogN) Yeast (2884×17), lymphoma (4026×96)
Getz et al.
(CTWC)
Yes One set at a time Exponential Leukemia (1753×72), colon cancer (2000×62)
Lazzeroni & Owen Yes One at a time Polynomial Food (961×6),
36
Lazzeroni & Owen
(Plaid Models)
Yes One at a time Polynomial Food (961×6), forex (276×18), yeast (2467×79)
Ben-Dor et al.
(OPSM)
Yes All at the same time
O(NM3l) Breast tumor (3226×22)
Tanay et al.
(SAMBA)
Yes All at the same time
O((N2d+1)log(r+1)/r(rd)) Lymphoma (4026×96), yeast (6200×515)
Yang et al.
(FLOC)
Yes All at the same time
O((N+M)2kp) Yeast (2884×17)
Kluger et al.
(Spectral)
No All at the same time
Polynomial Lymphoma (1 rel., 1 abs.), leukemia, breast cell line, CNS embryonal tumor
Cheng and Church’s Algorithm
• Cheng and Church were the first to introduce biclustering to geneexpression analysis .
• Their algorithmic framework represents the biclustering problem as anoptimization problem, defining a score for each candidate bicluster anddeveloping heuristics to solve the constrained optimization problem definedby this score function. The constraints force the uniformity of the matrixby this score function. The constraints force the uniformity of the matrixand the procedure gives preference to larger submatrices.
• Cheng and Church implicitly assume that (gene, condition) pairs in a “good”bicluster have a constant expression level, plus possibly additive row andcolumn specific effects.
37
Metodi numerici per la
bioinformatica
Biclustering of Expression dataY. Cheng and M.Church,
ISMB 2000
Cheng and Church’s Algorithm
• Model: A bicluster is represented by the submatrix A of the
whole expression matrix (the involved rows and columns need
not be contiguous in the original matrix).
• Each entry aij in the bicluster is the summation of:
1. The background level1. The background level
2. The row (gene) effect
3. The column (condition) effect
• A dataset contains a number of biclusters, which are not
necessarily disjoint.
Metodi numerici per la
bioinformatica Francesco Archetti38
Cheng and Church’s Algorithm:residue
• In the matrix A the residue score of element aij is given by:
jI
J
• aiJ = mean of row i
|| I
aa Ii ij
Ij
∑ ∈=
|| J
aa Jj ij
iJ
∑ ∈=
•aIj=mean of column j
Metodi numerici per la
bioinformatica Francesco Archetti39
IJIjiJijij aaaaaR +−−=)(
ai
• Biological meaning: the genes have the same (amount of) response to theconditions
|| I
||||,
JI
aa JjIi ij
IJ
∑ ∈∈=•aIj= mean of A
• The mean square residue is the variance of the set of all
elements in the bicluster, plus the mean row variance and the
mean column variance.
Cheng and Church’s Algorithm:mean square residue
∑∑ =+−−= ijRaaaaJIH
22)(
1),(
• A submatrix AIJ is called a δ-bicluster if H(I,J)≤ δ for someδ≥0.
Metodi numerici per la
bioinformatica Francesco Archetti40
∑∑∈∈∈∈
=+−−=JjIi
ij
JjIiIJIjiJij JI
aaaaJI
JIH,,
2
||||)(
||||
1),(
GOAL: find biclusters with low mean squared residue, inparticular, large and maximal ones with scores below a certain
threshold δ.
Cheng & Church’ algorithm
• A score of H(I,J)=0 would mean that we are in the case of a constant biclusterof elements of a single value. (The gene expression levels fluctuates in unison)
∑∑∈∈∈∈
=+−−=JjIi
ij
JjIiIJIjiJij JI
Raaaa
JIJIH
,
2
,
2
||||)(
||||
1),(
of elements of a single value. (The gene expression levels fluctuates in unison)
• With a score of H(I,J)≠0 it is always possible to remove a row ora a column tolower the score, until the remaining bicluster becomes constant.
• The global H score gives an indicator of how data fits together within thatmatrix; whether it has some coherence or is random:
– A high H value signifies that data is uncorrelated.
– A low H score values means that there is a correlation in the matrix
Metodi numerici per la
bioinformatica Francesco Archetti41
Minimum squared residue: example
Metodi numerici per la
bioinformatica Francesco Archetti42
• If 5 was replaced with 3 then the score would change to : H(M2)= 2.06
•A matrix with elements randomly and uniformly generated in the range [a,b] (a=1, b=12), has
an expected score of(b-a)2/12. In this case: H(M3)= (12-1)2/12=10.08
Cheng & Church’ algorithm
• Constraints:
– 1xM and Nx1 matrixes always give zero residue.
�Find biclusters with maximum sizes, with residues not
more than a threshold δ (largest δ-biclusters)more than a threshold δ (largest δ-biclusters)
– Constant matrixes always give zero residue.
�Use average row variance to evaluate the “interestingness”
of a bicluster.
Biologically, it represents genes that have large change in
expression values over different conditions.
Metodi numerici per la
bioinformatica Francesco Archetti43
Cheng & Church’ algorithm
• Objective function for heuristic methods (to minimize):
∑∑∈∈∈∈
=+−−=JjIi
ij
JjIiIJIjiJij JI
Raaaa
JIJIH
,
2
,
2
||||)(
||||
1),(
� sum of the components from each row and column, which suggestssimple greedy algorithms to evaluate each row and columnindependently
Metodi numerici per la
bioinformatica Francesco Archetti44
Cheng and Church’s Algorithm
• Greedy approach to rapidly converge to a maximal
bicluster.
• In phase I, it removes rows/columns with a large
contribution to the mean residue score (msr).contribution to the mean residue score (msr).
• In phase II, rows/columns are added that have a low
contribution to the msr without exceeding δ.
• After a bicluster is identified, its values are randomized
to prevent it to show up again.
Cheng and Church’s Algorithm
Given the threshold parameter δ, the algorithm runs in two phases:
FIRST PHASE:•the algorithm removes rows and columns from the full matrix. At each step,where the
current submatrix has row set and column set , the algorithm examines the set of possible
moves.
∑ ∈=
Jj JI jiRSJ
id ),(||
1)( ,
Francesco Archetti
•for rows it calculates:
•for columns it calculates:
• It then selects the highest scoring row or column and removes it from the current
submatrix, as long as H(I,J)>δ.
� The idea is that rows/columns with large contribution to the score can be removed
with guaranteed improvement (decrease) in the total mean square residue score.
� A possible variation of this heuristic removes at each step all rows/columns with a
contribution to the residue score that is higher than some threshold.46
∑ ∈Jj JI jiRSJ
id ),(||
)( ,
∑ ∈=
Ii JI jiRSI
je ),(||
1)( ,
Cheng and Church’s Algorithm
SECOND PHASE:• Goal: increases the matrix size without crossing the threshold δ.
For this rows and columns are being added, using the same scoring scheme, butthis time looking for the lowest square residues d(i) e(j) at each move, and terminating where none of the possible moves increases the matrix size withoutcrossing the threshold δ.
Francesco Archetti
crossing the threshold δ.
�Upon convergence, the algorithm outputs a submatrix with low mean residue and locally maximal size.
�To discover more than one bicluster, Cheng and Church suggested repeatedapplication of the biclustering algorithm on modified matrices. The modificationincludes randomization of the values in the cells of the previously discoveredbiclusters, preventing the correlative signal in them to be beneficial for any otherbicluster in the matrix. This has the obvious effect of precluding the identification of biclusters with significant overlaps.
Metodi numerici per la bioinformatica 47
Evolutionary bicluster
• Binary encoding for rows/columns
• Fitness:
– mean squared residue– mean squared residue
– row variance
– large volume
– penalty (exponential)
• Typical genetic operators
Metodi numerici per la
bioinformatica Francesco Archetti48
Evolutionary Biclustering of Gene ExpressionsH.Banka and S.Mitra
ACM, Ubiquity, 7 (42) 2006
Genetic Algorithms -a brief introduction-
• The idea of genetic algorithm (GA) was first introduced by John Holland inearly 1970’s
• based on the adaptive global search heuristic inspired by natural evolutionand genetics with survival of the fittest strategy.
• It is a stochastic population based search strategy works on biologicalmechanism of natural selection, crossover, and mutation.mechanism of natural selection, crossover, and mutation.
• GAs are executed iteratively on a set of coded solutions, called population,with the three basic operators: selection, crossover, and mutation.
• For solving a problem, GA starts with a set of encoded random solutions(i.e., chromosomes) and evolves better set of solutions over generations(iterations) by applying the basic GA operators.
• Better solutions are determined from objective values (fitness functions) thatdetermines the suitability of reproduction for the solutions. Hence bettersolutions are selected whereas the bad ones are eliminated from thepopulation at each generation
Metodi numerici per la
bioinformatica Francesco Archetti49
Simple Genetic Algorithm
{
initialize population;
evaluate population;
while Termination Criteria Not Satisfied{
select parents for reproduction;
perform recombination and mutation;
evaluate population;}
}
Evolutionary biclustering: Representation
• An encoded solution representing a bicluster:
– Each bicluster is represented by a fixed sized binary string calledchromosome or individual, with a bit string for genes appended byanother bit string for conditions.
– The chromosome corresponds to a solution for this optimal biclustergeneration problem.
– A bit is set to one if the corresponding gene and/or condition ispresent in the bicluster, and reset to zero otherwise.
Metodi numerici per la
bioinformatica 51
Evolutionary biclustering: fitness function
• Goal: generating maximal set of genes and conditions while
maintaining the “homogeneity” of the biclusters
• Maximize:
Multi-objective optimization
• where:
– g and c are the number of ones in the genes and conditions within the bicluster,
– G(g, c) is its mean squared residue score
– δ is the user-defined threshold for the maximum acceptable dissimilarity or meansquared residue score of the bicluster
– G and C are the total number of genes and conditions of the original geneexpression array
Metodi numerici per la
bioinformatica Francesco Archetti52
optimization
Evolutionary biclustering: Local search
• Since the initial biclusters are generated randomly, it may happenthat some irrelevant genes and/or conditions get included in spite oftheir expression values lying far apart in the feature space.
• An analogous situation may also arise during crossover and mutationin each generation.
• These genes and conditions, with dissimilar values, need to be• These genes and conditions, with dissimilar values, need to beeliminated deterministically.
• Furthermore, for good biclustering, some genes and/or conditionshaving similar expression values need to be incorporated as well.
• The algorithm starts with a given bicluster and an initial geneexpression array (G,C).
• The irrelevant genes or conditions having mean squared residueabove (or below) a certain threshold are now selectively eliminated(or added) using the some conditions.
Metodi numerici per la
bioinformatica Francesco Archetti53
• Domination: The conditions for a solution to be dominated with respect to the other solutions is:If there are M objective functions, a solution x(1) is said to dominate another solution x(2), if both conditions the solution x(1) is no worse than x(2) in all the M objective functions and the solution x(1) is strictly better than x(2) in at least one of the M objective functions.
Evolutionary biclustering:
• Crowding distance: this assigns the highest value to the boundarysolutions and the average distance of two solutions [(i+1)th and (i−1)th] oneither side of solution i along each of the objectives.
• Crowding selection: A solution i wins tournament with another solution jif:
– solution i has better rank, i.e., ri < rj .
– both the solutions are in the same front, i.e., ri = rj , but solution i is lessdensely located in the search space, i.e., di > dj .
Metodi numerici per la
bioinformatica Francesco Archetti54
Evolutionary biclustering: The algorithm
The main steps of the proposed algorithm, repeated over a specified number ofgenerations, are:
1. Generate a random population of size P.
2. Delete or add multiple nodes (genes and conditions) from each individual of thepopulation.
3. Calculate the multi-objective fitness functions f1 and f2
4. Rank the population using the dominance criteria.4. Rank the population using the dominance criteria.
5. Calculate crowding distance.
6. Perform selection using crowding tournament selection.
7. Perform crossover and mutation (as in conventional GA) to generate offspringpopulation of size P.
8. Combine parent and offspring population.
9. Rank the mixed population using dominance criteria and crowding distance, asabove.
10.Replace the parent population by the best |P| members of the combinedpopulation.
Metodi numerici per la
bioinformatica Francesco Archetti55
Biclustering advantages
1. automatically selects genes and conditions with more coherentmeasurement
2. groups items based on a similarity measure that depends on acontext, which is best defined as a subset of the attributes. Itdiscovers not only the grouping, but the context as well. And todiscovers not only the grouping, but the context as well. And tosome extent, these two become inseparable and exchangeable, whichis a major difference between biclustering and clustering rows afterclustering columns.
3. allows rows and columns to be included in multiple biclusters, andthus allows one gene or one condition to be identified by more thanone function categories. This added flexibility correctly reflects thereality in the functionality of genes and overlapping factors in tissuesamples and experiment conditions.
Metodi numerici per la bioinformatica Francesco Archetti56
Biclustering: observations
• The algorithms presented demonstrate some of the approaches
developed for the identification of bicluster patterns in large
matrices, and in gene expression matrices in particular.
• A classification of the different methods ca be:
a) By their model and scoring schemes
b) By the type of algorithm used for detecting biclusters
Metodi numerici per la
bioinformatica Francesco Archetti57
Biclustering: models and score
• To ensure that the biclusters are statistically significant, each of thebiclustering methods defines a scoring scheme to assess the quality ofcandidate biclusters, or a constraint that determines which submatricesrepresent significant bicluster behavior.
• Constraint based methods: search for gene (property) sets that define ”stable” subsets of properties. subsets of properties. Algorithms: iterative signature algorithm, the coupled two-way clustering method and the spectral algorithm of Kluger et al.
• Scoring based methods : rely on a background model for the data. The basic model assumes that biclusters are essentially uniform submatrices and scores them according to their deviation from such uniform behavior. More elaborate models allow different distributions for each condition and gene, usually in a linear way. Algorithms: the Cheng-Church algorithm and the Plaid model.
Metodi numerici per la
bioinformatica Francesco Archetti58
Biclustering: algorithmic approaches
• The algorithmic approaches for detecting biclusters given the
data are greatly affected by the type of score/constraint model
in use:
– Several algorithms alternate between phases of gene sets and conditionsets optimization (the iterative signature algorithm and the coupledsets optimization (the iterative signature algorithm and the coupledtwo-way clustering algorithm.)
– Other use standard linear algebra or optimization algorithms to solvekey subproblems. (Plaid model and the Spectral algorithm)
– A heuristic hill climbing algorithm is used in the Cheng-Churchalgorithm.
Metodi numerici per la
bioinformatica Francesco Archetti59
Research Opportunities
Many issues in biclustering algorithm design also remain open and
should be addressed by the scientific community:
– Propose other bicluster models.
– Based on the current models, propose new algorithms that improve– Based on the current models, propose new algorithms that improvebicluster quality (validated statistically or biologically) and/or timecomplexity.
– Combine the strength of multiple studies.
– Investigate the effects of normalization to the models/algorithms.
– Compare the different methods on some other real datasets.
– Make better use of domain knowledge.
Metodi numerici per la
bioinformatica Francesco Archetti60