1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.
-
date post
21-Dec-2015 -
Category
Documents
-
view
227 -
download
1
Transcript of 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.
![Page 1: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/1.jpg)
1
Microarray Data Analysis
Class discovery and Class prediction: Clustering and Discrimination
![Page 2: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/2.jpg)
2
Gene expression profiles
Many genes show definite changes of expression between conditions
These patterns are called gene profiles
![Page 3: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/3.jpg)
3
Motivation (1): The problem of finding patterns It is common to have hybridizations where
conditions reflect temporal or spatial aspects. Yeast cycle data Tumor data evolution after chemotherapy CNS data in different part of brain
Interesting genes may be those showing patterns associated with changes.
Our problem seems to be distinguishing interesting or real patterns from meaningless variation, at the level of the gene
![Page 4: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/4.jpg)
4
Finding patterns: Two approaches
If patterns already exist Profile comparison (Distance analysis)Find the genes whose expression fits specific,
predefined patterns.Find the genes whose expression follows the
pattern of predefined gene or set of genes. If we wish to discover new patterns
Cluster analysis (class discovery)Carry out some kind of exploratory analysis to
see what expression patterns emerge;
![Page 5: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/5.jpg)
5
Motivation (2): Tumor classification A reliable and precise classification of tumours is
essential for successful diagnosis and treatment of cancer.
Current methods for classifying human malignancies rely on a variety of morphological, clinical, and molecular variables.
In spite of recent progress, there are still uncertainties in diagnosis. Also, it is likely that the existing classes are heterogeneous.
DNA microarrays may be used to characterize the molecular variations among tumors by monitoring gene expression on a genomic scale. This may lead to a more reliable classification of tumors.
![Page 6: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/6.jpg)
6
Tumor classification, cont
There are three main types of statistical problems associated with tumor classification:
1. The identification of new/unknown tumor classes using gene expression profiles - cluster analysis;
2. The classification of malignancies into known classes - discriminant analysis;
3. The identification of “marker” genes that characterize the different tumor classes - variable selection.
![Page 7: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/7.jpg)
7
Cluster and Discriminant analysis
These techniques group, or equivalently classify, observational units on the basis of measurements.
They differ according to their aims, which in turn depend on the availability of a pre-existing basis for the grouping. In cluster analysis (unsupervised learning, class
discovery) , there are no predefined groups or labels for the observations,
Discriminant analysis (supervised learning, class prediction) is based on the existence of groups (labels)
![Page 8: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/8.jpg)
8
Clustering microarray data
Cluster can be applied to genes (rows), mRNA samples (cols), or both at once.Cluster samples to identify new cell or tumour
subtypes.Cluster rows (genes) to identify groups of co-
regulated genes.We can also cluster genes to reduce
redundancy e.g. for variable selection in predictive models.
![Page 9: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/9.jpg)
9
Advantages of clustering
Clustering leads to readily interpretable figures. Clustering strengthens the signal when averages
are taken within clusters of genes (Eisen). Clustering can be helpful for identifying patterns
in time or space. Clustering is useful, perhaps essential, when
seeking new subclasses of cell samples (tumors, etc).
![Page 10: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/10.jpg)
10
Applications of clustering
Alizadeh et al (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling.
Three subtypes of lymphoma (FL, CLL and DLBCL) have different genetic signatures. (81 cases total)
DLBCL group can be partitioned into two subgroups with significantly different survival. (39 DLBCL cases)
![Page 11: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/11.jpg)
11
Clusters on both genes and arrays
Taken from Nature February, 2000Paper by Allizadeh. A et alDistinct types of diffuse large B-cell lymphoma identified by Gene expression profiling,
![Page 12: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/12.jpg)
12
Discovering tumor subclasses DLBCL is clinically
heterogeneous Specimens were
clustered based on their expression profiles of GC B-cell associated genes.
Two subgroups were discovered: GC B-like DLBCL Activated B-like
DLBCL
![Page 13: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/13.jpg)
13
Basic principles of clustering
Aim: to group observations that are “similar” based on predefined criteria.
Issues: Which genes / arrays to use?
Which similarity or dissimilarity measure?
Which clustering algorithm?
It is advisable to reduce the number of genes from the full set to some more manageable number, before clustering. The basis for this reduction is usually quite context specific, see later example.
![Page 14: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/14.jpg)
14
Measures of dissimilarity
Correlation Distance
ManhattanEuclideanMahalanobis distanceMany more ….
![Page 15: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/15.jpg)
15
Two basic types of methods
Partitioning Hierarchical
![Page 16: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/16.jpg)
16
Partitioning methods
Partition the data into a pre-specified number k of mutually exclusive and exhaustive groups.
Iteratively reallocate the observations to clusters until some criterion is met, e.g. minimize within cluster sums of squares.
Examples: k-means, self-organizing maps (SOM), PAM, etc.; Fuzzy: needs stochastic model, e.g. Gaussian
mixtures.
![Page 17: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/17.jpg)
17
Hierarchical methods
Hierarchical clustering methods produce a tree or dendrogram.
They avoid specifying how many clusters are appropriate by providing a partition for each k obtained from cutting the tree at some level.
The tree can be built in two distinct waysbottom-up: agglomerative clustering; top-down: divisive clustering.
![Page 18: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/18.jpg)
18
Agglomerative methods
Start with n clusters. At each step, merge the two closest clusters
using a measure of between-cluster dissimilarity, which reflects the shape of the clusters.
Between-cluster dissimilarity measuresMean-link: average of pairwise dissimilaritiesSingle-link: minimum of pairwise
dissimilarities.Complete-link: maximum& of pairwise
dissimilarities.Distance between centroids
![Page 19: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/19.jpg)
19
Distance between centroids Single-link
Complete-link Mean-link
![Page 20: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/20.jpg)
20
Divisive methods
Start with only one cluster. At each step, split clusters into two parts. Split to give greatest distance between two new
clusters Advantages.
Obtain the main structure of the data, i.e. focus on upper levels of dendogram.
Disadvantages. Computational difficulties when considering all
possible divisions into two groups.
![Page 21: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/21.jpg)
21
1 5 2 3 4
1 5 2 3 4
1,2,5
3,41,5
1,2,3,4,5
Agglomerative
Illustration of points In two dimensional space
1
5
34
2
![Page 22: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/22.jpg)
22
1 5 2 3 4
1 5 2 3 4
1,2,5
3,41,5
1,2,3,4,5
Agglomerative
Tree re-ordering?
1
5
34
2
1 52 3 4
![Page 23: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/23.jpg)
23
Partitioning or Hierarchical? Partitioning:
Advantages Optimal for certain criteria. Genes automatically
assigned to clusters
Disadvantages Need initial k; Often require long
computation times. All genes are forced into a
cluster.
Hierarchical Advantages
Faster computation. Visual.
Disadvantages Unrelated genes are
eventually joined Rigid, cannot correct
later for erroneous decisions made earlier.
Hard to define clusters.
![Page 24: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/24.jpg)
24
Hybrid Methods
Mix elements of Partitioning and Hierarchical methodsBagging
Dudoit & Fridlyand (2002)
HOPACH van der Laan & Pollard (2001)
![Page 25: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/25.jpg)
25
Three generic clustering problems
Three important tasks (which are generic) are:
1. Estimating the number of clusters;2. Assigning each observation to a cluster;3. Assessing the strength/confidence of
cluster assignments for individual observations.
Not equally important in every problem.
![Page 26: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/26.jpg)
26
Estimating number of clusters using silhouette
Define silhouette width of the observation as :
S = (b-a)/max(a,b) Where a is the average dissimilarity to all the points in the cluster and
b is the minimum distance to any of the objects in the other clusters. Intuitively, objects with large S are well-clustered while the ones with
small S tend to lie between clusters. How many clusters: Perform clustering for a sequence of the
number of clusters k and choose the number of components corresponding to the largest average silhouette.
Issue of the number of clusters in the data is most relevant for novel class discovery, i.e. for clustering samples
![Page 27: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/27.jpg)
27
Estimating number of clusters using the bootstrapThere are other resampling (e.g. Dudoit and Fridlyand, 2002) and non-resampling based rules for estimating the number of clusters (for review see Milligan and Cooper (1978) and Dudoit and Fridlyand (2002) ).
The bottom line is that none work very well in complicated situation and, to a large extent, clustering lies outside a usual statistical framework.
It is always reassuring when you are able to characterize a newly discovered clusters using information that was not used for clustering.
![Page 28: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/28.jpg)
28
LimitationsCluster analyses: Usually outside the normal framework of statistical
inference; less appropriate when only a few genes are likely to
change. Needs lots of experiments Always possible to cluster even if there is nothing going
on. Useful for learning about the data, but does not provide
biological truth.
![Page 29: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/29.jpg)
29
Examples
![Page 30: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/30.jpg)
30
The data We will use the dataset presented in van't Veer et al.
(2002) which is available at http://www.rii.com/publications/2002/vantveer.htm.
These data come from a study of gene expression in 91 breast cancer node-negative tumors.
The training samples consisted of 78 tumors, 44 of which did not recur within 5 years of diagnosis and 34 did.
Among the test samples, 7 have not recurred within 5 years and 12 did.
The data were collected on two color Agilant oligo arrays containing about 24K genes..
![Page 31: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/31.jpg)
31
Preprocessing The data has been filtered using procedures described in
the original manuscript. Only genes showing 2-fold differential expression and p-value
for a gene being expressed < 0.01 in more than 5 samples are retained.
There are 4,348 such genes. Missing values were imputed using k-nearest neighbors
imputation procedure (Troyanskaya, et al, 2001). There, for each gene containing at leats one missing value we
find 5 genes most highly correlated with it and take average of their value for the sample in which a value for a given gene is missing.
The missing value is replaced with the average.
![Page 32: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/32.jpg)
32
R data
The filtered gene expression levels are stored in a 4348 × 97 matrix named bcdata with rows corresponding to genes and columns to mRNA samples.
Additionally, the labels are contained in the 97-element vector surv.resp with 0 indicating good outcome (no recurrence within 5 years after diagnosis) and 1 indicating bad outcome (recurrence within 5 years after diagnosis).
![Page 33: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/33.jpg)
33
Hierarchical clustering (1) Start performing a hierarchical clustering on the mRNA
samples using correlation as similarity function and complete linkage agglomeration
library(stats)
x1<-as.dist(1 - cor(bcdata)
clust.cor <- hclust(x1), method="complete")
plot(clust.cor, cex = 0.6)
![Page 34: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/34.jpg)
34
Hierarchical clustering (1)
![Page 35: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/35.jpg)
35
Hierarchical clustering (2) Perform a hierarchical clustering on the mRNA samples
using Euclidean distance and average linkage agglomeration. Results can differ significantly.
X2<-dist(t(bcdata)
clust.euclid <- hclust(x2), method = "average")
plot(clust.euclid, cex = 0.6)
![Page 36: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/36.jpg)
36
Hierarchical clustering (2)
54:b
ad37
:goo
d73
:bad
66:b
ad34
:goo
d97
:bad
77:b
ad94
:bad
55:b
ad78
:bad
75:b
ad43
:goo
d64
:bad
96:b
ad41
:goo
d23
:goo
d39
:goo
d72
:bad
15:g
ood
4:go
od63
:bad
32:g
ood
22:g
ood
28:g
ood
56:b
ad88
:goo
d2:
good
33:g
ood
1:go
od6:
good
49:b
ad58
:bad
47:b
ad9:
good
13:g
ood
3:go
od81
:bad
83:g
ood
14:g
ood
42:g
ood
26:g
ood
92:b
ad61
:bad
70:b
ad18
:goo
d31
:goo
d10
:goo
d51
:bad
5:go
od86
:goo
d76
:bad
19:g
ood
36:g
ood
29:g
ood
46:b
ad30
:goo
d27
:goo
d82
:goo
d84
:goo
d17
:goo
d7:
good
21:g
ood
59:b
ad89
:bad
35:g
ood
85:g
ood
95:b
ad60
:bad
69:b
ad62
:bad
52:b
ad11
:goo
d79
:bad
16:g
ood
40:g
ood 38
:goo
d25
:goo
d45
:bad
53:b
ad68
:bad
57:b
ad93
:bad
80:b
ad8:
good
71:b
ad24
:goo
d67
:bad
12:g
ood
65:b
ad74
:bad
44:g
ood
48:b
ad50
:bad 20
:goo
d87
:goo
d90
:bad
91:b
ad
10
20
30
40
50
Cluster Dendrogram
hclust (*, "average")dist(t(bcdata))
He
igh
t
![Page 37: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/37.jpg)
37
Comparison between orderingsSample CORR.GROUP EUC.GROUP1:good 1 12:good 1 16:good 1 17:good 1 114:good 1 119:good 1 121:good 1 123:good 1 130:good 1 139:good 1 141:good 1 142:good 1 149:bad 1 158:bad 1 161:bad 1 170:bad 1 172:bad 1 188:good 1 192:bad 1 13:good 2 19:good 2 113:good 2 147:bad 2 183:good 2 14:good 3 111:good 3 116:good 3 140:good 3 152:bad 3 163:bad 3 1
IN THIS CASE WE OBSERVE THAT: Clustering based on
correlation and complete linkage dsitributes samples more uniformly between groups
Euclidean-average linkage combination yields one huge group and many small one
![Page 38: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/38.jpg)
38
Partitioning around medoids If we assume a fixed number of clusters we can
use a partitioning method such as PAM It is a robust version of k-means which clusters
data around the medoids
library(cluster)
x3<-as.dist(1-cor(bcdata))
clust.pam <- pam(x3, 5, diss =TRUE)
clusplot(clust.pam, labels = 5,
col.p = clust.pam$clustering)
![Page 39: 1 Microarray Data Analysis Class discovery and Class prediction: Clustering and Discrimination.](https://reader031.fdocuments.net/reader031/viewer/2022032310/56649d605503460f94a410c0/html5/thumbnails/39.jpg)
39
PAM clustering (2)
-0.5 0.0 0.5
-0.6
-0.2
0.2
0.4
0.6
clusplot(pam(x = x3, k = 5, diss = TRUE))
Component 1
Co
mp
on
en
t 2
These two components explain 24.97 % of the point variability.
1
2
34
5