MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al....
Transcript of MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al....
![Page 1: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/1.jpg)
MIT OpenCourseWare http://ocw.mit.edu
6.047 / 6.878 Computational Biology: Genomes, Networks, EvolutionFall 2008
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
![Page 2: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/2.jpg)
Clustering
Lecture 3 September 16, 2008
Computational Biology: Genomes, Networks, Evolution
![Page 3: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/3.jpg)
Structure in High-Dimensional Data
Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer Graphics 13, no. 6 (2007): 1432-1439.
• Structure can be used to reduce dimensionality of data
• Structure can tell us something useful about the underlying phenomena
• Structure can be used to make inferences about new data
©2007 IEEE. Used with permission.
![Page 4: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/4.jpg)
Clustering vs Classification• Objects characterized by one or more
features
• Classification– Have labels for some points– Want a “rule” that will accurately assign
labels to new points– Supervised learning
• Clustering– No labels– Group points into clusters based on how
“near” they are to one another– Identify structure in data– Unsupervised learning
Expression in Exp 1
Expr
essi
on in
Exp
2
![Page 5: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/5.jpg)
Today
• Microarray Data
• K-means clustering
• Expectation Maximization
• Hierarchical Clustering
![Page 6: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/6.jpg)
Central Dogma
phenotypeDNA mRNA protein
We can measure amounts of mRNA for every gene in
a cell
![Page 7: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/7.jpg)
Expression Microarrays
• A way to measure the levels of mRNA in every gene
• Two basic types– Affymetrix gene chips– Spotted oligonucleotides
• Both work on same principle– Put DNA probe on slide– Complementary hybridization
![Page 8: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/8.jpg)
Expression Microarrays
• Measure the level of mRNA messages in a cell
DN
A 1
DN
A 3
DN
A 5
DN
A 6
DN
A 4
DN
A 2
cDNA 4
cDNA 6
Hybridize Gen
e 1
Gen
e 3
Gen
e 5
Gen
e 6
Gen
e 4
Gen
e 2
MeasureRT
RNA 4
RNA 6
![Page 9: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/9.jpg)
Expression Microarray Data Matrix
• Genes are typically given as rows
• Experiment are given by columns
n experiments
m g
enes
![Page 10: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/10.jpg)
Clustering and Classification in Genomics
• Classification¾ Microarray data: classify cell state (i.e. AML vs ALL) using
expression data¾ Protein/gene sequences: predict function, localization, etc.
• Clustering¾ Microarray data: groups of genes that share similar function
have similar expression patterns – identify regulons¾ Protein sequence: group related proteins to infer function¾ EST data: collapse redundant sequences
![Page 11: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/11.jpg)
![Page 12: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/12.jpg)
Clustering Expression Data
• Cluster Experiments– Group by similar
expression profiles
• Cluster Genes– Group by similar
expression in different conditions
Gene 1
Gen
e 2
Experiment
Experiment 1
Expe
rimen
t 2
Genes
![Page 13: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/13.jpg)
Why Cluster Genes by Expression?• Data Exploration
– Summarize data– Explore without getting lost
in each data point– Enhance visualization
• Co-regulated Genes– Common expression may
imply common regulation – Predict cis-regulatory
promoter sequences
• Functional Annotation– Similar function from
similar expression
GCN4
His2 Amino AcidsHis3 Amino AcidsUnknown Amino Acids
![Page 14: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/14.jpg)
Clustering Algorithms
• Partitioning– Divides objects into non-overlapping
clusters such that each data object is in exactly one subset
• Agglomerative– A set of nested clusters organized as a
hierarchy
![Page 15: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/15.jpg)
K-Means Clustering
The Basic Idea
• Assume a fixed number of clusters, K
• Goal: create “compact” clusters
![Page 16: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/16.jpg)
More Formally1. Initialize K centers uk
For each iteration n until convergence
2. Assign each xi the label of the nearest center, where the distance between xi and uk is
3. Move the position of each uk to the centroid of the points with that label
i
kiik
x with label j
( 1) , x # with label kxk n + = =∑ xμ x
( )2,i k i kd = −x μ
![Page 17: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/17.jpg)
Cost Criterion
( ) ( )21 2 3 n i
with label k
COST , , ,...,k i
k= −∑ ∑μ x
x x x x x μ
We can think of K-means as trying to create clusters that minimize a cost criterion associated with the size of the
cluster
μ1
μ2
μ3
Minimizing this means minimizing each cluster term separately:
( )i i
2 2 2 2i
with label k with label k
with label
2
k
Optimum , t
2 2
= he centroid
i i
kk
i
kk i k i
ik k
k− = − + = − +∑ ∑ ∑ ∑
∑x x
x
x μ x x u μ x u x x u
xux
![Page 18: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/18.jpg)
Fuzzy K-Means• Initialize K centers uk
• For each point calculate the probability of membership for each category
• Move the position of each uk to the weighted centroid :
• Iterate
K-means
Fuzzy K-means
i i
i i ix with label j x with label j
( 1) P( | ) P( | )bbk k kn + = ∑ ∑μ x μ x μ x
iP(label K | , )kx μ
Of course, K-Means just special case wherei
i
1 if is closest toP(label K | , )
0k
k⎧
= ⎨⎩
x μx μ
otherwise
![Page 19: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/19.jpg)
K-Means as a Generative Model
Samples drawn from two equally normal distributions with unit variance - a Gaussian Mixture Model
xiModel ofP(X,Labels)μ1
μ2
( ) ( )21| exp
22i j
i jPπ
⎧ ⎫−⎪ ⎪= −⎨ ⎬⎪ ⎪⎩ ⎭
x ux u
![Page 20: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/20.jpg)
Unsupervised Learning
Learn?
Samples drawn from two equally normal distributions with unit variance - a Gaussian Mixture Model
xi
μ1
μ2
( ) ( )21| exp
22i j
i jPπ
⎧ ⎫−⎪ ⎪= −⎨ ⎬⎪ ⎪⎩ ⎭
x ux u
![Page 21: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/21.jpg)
If We Have Labeled PointsNeed to estimate unknown gaussian centers from data
In general, how could we do this? How could we “estimate” the “best” uk?
Choose uk to maximize probability of model
xi
μ1 ?
μ2?
![Page 22: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/22.jpg)
If We Have Labeled PointsNeed to estimate unknown gaussian centers from data
In general, how could we do this? How could we “estimate” the “best” uk?
Given a set of xi, all with label k, we can find the maximum likelihood μk from
( ) ( )
( )
2
i
2arg
1 1arg max log P | arg max log2
mi
2
n
i ii
ii
π⎧ ⎫⎧ ⎫ ⎛ ⎞= − − +⎨ ⎬ ⎨ ⎬⎜ ⎟
⎝ ⎠⎩ ⎭ ⎩ ⎭
= −
∑∏
∑
μ μ
μ
x μ x u
x uSolution is
the centroidof the xi
![Page 23: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/23.jpg)
If We Know Cluster CentersNeed to estimate labels for the data
Learn? xi
Distance measure for
K-means
μ1
μ2
( ) ( ) ( )2
2k
k k k
1arg max P | arg max exp arg m22
ini ki i i kπ
⎧ ⎫−⎪ ⎪= − = −⎨ ⎬⎪ ⎪⎩ ⎭
x ux μ x u
![Page 24: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/24.jpg)
What If We Have Neither?
xi
μ1
μ2
An idea:1. Imagine we start with some uk
0
2. We could calculate the most likely labels for xi
0 given these uk0
3. We could then use these labels to choose uk
1
4. And iterate (to convergence)
( )
( )( )
0
0 0k
k
1 0
i
0 0k
k...
= arg max P |
arg max log P |
= arg max P |
k
i
i i k
k k
i i k
labels
labels
⎧ ⎫= ⎨ ⎬
⎩ ⎭∏
μ
μ
x μ
μ x μ
x μ
![Page 25: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/25.jpg)
Expectation Maximization (EM)1. Initialize parameters
2. E Step Estimate probability of hidden labels , Q, given parameters and sequence
( | , )1i kQ P label x u t= −
1arg max log ( | , )t tk Q k
uu E P labels x u −⎡ ⎤= ⎣ ⎦
P(x|Model) guaranteed to increase each iteration
3. M Step Choose new parameters to maximize expected likelihood of parameters given Q
4. Iterate
![Page 26: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/26.jpg)
Expectation Maximization (EM)
Remember the basic idea!
1.Use model to estimate (distribution of) missing data2.Use estimate to update model
3.Repeat until convergence
Model is the gaussian distributions
Missing data are the data point labels
![Page 27: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/27.jpg)
Revisiting K-Means
1. Initialize K centers uk
2. Assign each xi the label of the nearest center, where the distance between xi and uk is
3. Move the position of each uk to the centroid of the points with that label
4. Iterate
Generative ModelPerspective
( )2,i k i kd = −x μ
The most likely label k for a point xi
Maximum likelihood parameter μk given
most likely label
![Page 28: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/28.jpg)
Revisiting K-MeansGenerative Model
Perspective
1. Initialize K centers uk
2. Assign each xi the label of the nearest center, where the distance between xi and uk is
3. Move the position of each uk to the centroid of the points with that label
4. Iterate
( )2,i k i kd = −x μ
The most likely label k for a point xi
Maximum likelihood parameter μk given
most likely label
1.Initialize parameters
2.E Step Estimate most likely missing label given previous parameter
3.M Step Choose newparameters to maximize likelihood of parameters given estimated labels
4.Iterate
![Page 29: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/29.jpg)
Revisiting K-Means
1. Initialize K centers uk
2. Assign each xi the label of the nearest center, where the distance between xi and uk is
3. Move the position of each uk to the centroid of the points with that label
4. Iterate
( )2,i k i kd = −x μ
Generative ModelPerspective
The most likely label k for a point xi
Maximum likelihood parameter μk given
most likely label
1.Initialize parameters
2.E Step Estimate most likely missing label given previous parameter
3.M Step Choose newparameters to maximize likelihood of parameters given estimated labels
4.Iterate
This is analogous to Viterbi Learning from HMMs
Analogy with HMM is to use Viterbi to find most
likely missing path labels
(see Durbin book)
![Page 30: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/30.jpg)
Revisting Fuzzy K-Means
Recall that given a set of xi, all with label k, we select a new μk with the update:
Looking at case b=1
It can be shown that this update rule follows from assuming the gaussian mixture generative models and performing Expectation-
Maximization
Recall that instead of assigning each point xi to a label k, we calculate the probability of each label for that point (fuzzy membership):
i i
i i ix with label j x with label j
( 1) P( | ) P( | )bbk k kn + = ∑ ∑μ x μ x μ x
iP(label K | , )kx μ
![Page 31: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/31.jpg)
Revisiting Fuzzy K-MeansGenerative Model
Perspective
1.Initialize parameters
2.E Step Estimate probability over missing labels given previous parameter
3.M Step Choose newparameters to maximize expected likelihood of parameters given estimated labels
4.Iterate
1. Initialize K centers uk
2. For each point calculate the probability of membership for each category
3. Move the position of each uk to the weighted centroid :
4. Iterate
i i
i i ix with label j x with label j
( 1) P( | ) P( | )bbk k kn + = ∑ ∑μ x μ x μ x
iP(label K | , )kx μ
This is analogous to Baum Welch from HMMs
![Page 32: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/32.jpg)
K-Means, Viterbi learning & EMK-Means and Fuzzy K-means are two related methods that can
be seen performing unsupervised learning on a gaussianmixture model
Reveal assumptions about underlying data model
Can relax assumptions by relaxing constraints on model• Including explicit covariance matrix• Relaxing assumption that all gaussians are equally likely
![Page 33: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/33.jpg)
Implications: Non-globular Clusters
Actual Clustering K-means (K = 2)
![Page 34: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/34.jpg)
But How Many clusters?
• How do we select K?– We can always make clusters “more
compact” by increasing K– e.g. What happens is if K=number of data
points?– What is a meaningful improvement?
• Hierarchical clustering side-steps this issue
![Page 35: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/35.jpg)
Hierarchical clusteringMost widely used algorithm for
expression data
• Start with each point in a separate cluster
• At each step:– Choose the pair of closest clusters– Merge
b
ed f
a
c
h
g
a b d e f g hc
Phylogeny (UMPGA)
![Page 36: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/36.jpg)
Visualization of results
![Page 37: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/37.jpg)
Hierarchical clustering
slide credits: M. Kellis
Avoid needing to select number of clusters
Produces clusters at all levels
We can always select a “cut level” to create disjoint clusters
But how do we define distances between clusters?
b
ed f
a
c
h
g
a b d e f g hc
![Page 38: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/38.jpg)
Distance between clusters• CD(X,Y)=minx ∈X, y ∈Y D(x,y)
Single-link method
• CD(X,Y)=maxx ∈X, y ∈Y D(x,y)Complete-link method
• CD(X,Y)=avgx ∈X, y ∈Y D(x,y)Average-link method
• CD(X,Y)=D( avg(X) , avg(Y) )Centroid method
ed f
h
g
ed f
h
g
ed f
h
g
ed f
h
g
![Page 39: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/39.jpg)
(Dis)Similarity Measures
D’haeseleer (2005) Nat Biotech
Image removed due to copyright restrictions. Table 1, Gene expression similarity measures. D'haeseleer, Patrik. "How Does Gene Expression Clustering Work?" Nature Biotechnology 23 (2005): 1499-1501.
![Page 40: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/40.jpg)
Evaluating Cluster Performance
• Robustness– Select random samples from data set and cluster– Repeat– Robust clusters show up in all clusters
• Category Enrichment– Look for categories of genes “over-represented” in
particular clusters– Also used in Motif Discovery
In general, it depends on your goals in clustering
![Page 41: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/41.jpg)
Evaluating clusters – Hypergeometric Distribution
∑≥
⎟⎟⎠
⎞⎜⎜⎝
⎛
⎟⎟⎠
⎞⎜⎜⎝
⎛−−
⎟⎟⎠
⎞⎜⎜⎝
⎛
=≥rm
kN
mkpN
mp
rposP )(
• N experiments, p labeled ++, (N-p) ––• Cluster: k elements, m labeled ++• P-value of single cluster containing k
elements of which at least r are ++
Prob that a randomly chosen set of k experiments would result in m positive and k-mnegative
P-value of uniformityin computed cluster
![Page 42: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/42.jpg)
Similar Genes Can Cluster
(Eisen (1998) PNAS)
(A) Cholesterol biosynthesis
(B) Cell cycle
(C) Immediate early response
(D) Signalling and angiogenesis
(E) Wound healing
Clustered 8600 human genes using expression time course in
fibroblasts
Eisen, Michael et al. "Cluster Analysis and Display of Genome-wide Expression Patterns." PNAS 95, no. 25 (1998): 14863-14868.Copyright (1998) National Academy of Sciences, U.S.A.
![Page 43: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/43.jpg)
Clusters and Motif Discovery
Expression from 15 time points during yeast
cell cycle
Tavazoie & Church (1999)
3.5
2.5
1.50.5
-0.5
-1.5G1 G1G2 G2S SM M
Ribosome (1)
S.D
. fro
m m
ean
2.5
1.5
0.5-0.5
-1.5
-2.5G1 G1G2 G2S SM M
RNA metabolism & translation (3)
S.D
. fro
m m
ean
2.5
1.5
0.5
-0.5
-1.5-1
0
1
2
3
G1 G1G2 G2S SM M
Methionine & sulphur metabolism
S.D
. fro
m m
ean
Figure by MIT OpenCourseWare.
![Page 44: MIT OpenCourseWare · 2019-09-12 · Structure in High-Dimensional Data. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer](https://reader033.fdocuments.net/reader033/viewer/2022050211/5f5d817ae2645f02a45c0e45/html5/thumbnails/44.jpg)
Next Lecture
The other side of the coin… Classification