Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to...
-
date post
21-Dec-2015 -
Category
Documents
-
view
218 -
download
1
Transcript of Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to...
Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and/or are co-regulated.
Goal B: Divide conditions to groups with similar gene expression profiles. Example: divide drugs according to their effect on gene expression.
Unsupervised Analysis
Clustering Methods
K-means: The Algorithm
Given a set of numeric points in d dimensional space, and integer k
Algorithm generates k (or fewer) clusters as follows:
1. Assign all points to a cluster at random2. Compute centroid for each cluster3. Reassign each point to nearest centroid
4. If centroids changed go back to stage 2
K-means: Example, k = 3
Step 1:Step 1: Make random assignments Make random assignments and compute centroids (big dots)and compute centroids (big dots)
Step 2:Step 2: Assign points to nearest Assign points to nearest centroidscentroids
Step 3:Step 3: Re-compute centroids (in this Re-compute centroids (in this example, solution is now stable)example, solution is now stable)
Fuzzy K means
The clusters produced by the k-means procedure are sometimes called "hard" or "crisp" clusters, since any feature vector x either is or is not a member of a particular cluster. This is in contrast to "soft" or "fuzzy" clusters, in which a feature vector x can have a degree of membership in each cluster.
The fuzzy-k-means procedure allows each feature vector x to have a degree of membership in Cluster i:
Fuzzy K means Algorithm
Make initial guesses for the means m1, m2,..., mk
Until there are no changes in any mean: Use the estimated means to find the degree of membership u(j,i) of xj in
Cluster i; for example, if dist(j,i) = exp(- || xj - mi ||2 ), one might use u(j,i)
= dist(j,i) / j dist(j,i) For i from 1 to k
Replace mi with the fuzzy mean of all of the examples for Cluster i
end_for end_until
j
jj
i iju
xiju
m2
2
),(
),(
Time course experiment
K-means: Sample Application
Gene clustering. Given a series of microarray
experiments measuring the expression of a set of genes at regular time intervals in a common cell line.
Normalization allows comparisons across microarrays.
Produce clusters of genes which vary in similar ways over time.
Hypothesis: genes which vary in the same way may be co-regulated and/or participate in the same pathway.
Sample Array. Rows are genes Sample Array. Rows are genes and columns are time points.and columns are time points.
A cluster of co-regulated genes.A cluster of co-regulated genes.
Iteration = 3
•Start with random position of K centroids.
•Iteratre until centroids are stable
•Assign points to centroids
•Move centroids to centerof assign points
Centroid Methods - K-means
Application of K-means to tome course experiments
Agglomerative Hierarchical Clustering
Results depend on distance update method Single linkage: elongated clusters Complete linkage: sphere-like clusters
Greedy iterative process Not robust against noise No inherent measure to choose the clusters
Gene Expression Data
Cluster genes and conditions
2 independent clustering: Genes represented as
vectors of expression in all conditions
Conditions are represented as vectors of expression of all genes
-0.4
-0.2
0
0.2
0.4
0.6
0.8
Experiments
Ge
ne
s
Colon cancer data (normalized genes)
10 20 30 40 50 60
200
400
600
800
1000
1200
1400
1600
1800
2000
1. Identify tissue classes (tumor/normal)
First clustering - Experiments
2. Find Differentiating And Correlated Genes
Second Clustering - Genes
Ribosomal proteins Cytochrome C
HLA2
metabolism
Two-wayClustering
Coupled Two-way Clustering (CTWC)
Motivation: Only a small subset of genes play a role
in a particular biological process; the other genes
introduce noise, which may mask the signal of the
important players. Only a subset of the samples exhibit
the expression patterns of interest.New Goal: Use subsets of genes to study subsets of samples (and vice versa) A non-trivial task – exponential number of subsets.CTWC is a heuristic to solve this problem.
0 10 20 30 40 50 60
0
10
20
30
40
50
60
0 10 20 30 40 50 60
0
10
20
30
40
50
60
CTWC of Colon Cancer Data
A
B
A
B
10 20 30 40 50 60
200
400
600
800
1000
1200
1400
1600
1800
2000
(A)
(B)
Multiple Testing Problem
Simultaneously test m null hypotheses, one for each gene j
Hj: no association between expression measure of gene j and the response
Because microarray experiments simultaneously monitor expression levels of thousands of genes, there is a large multiplicity issue
Increased chance of false positives
Hypothesis Truth Vs. Decision
# not rejected # rejected totals
# true H U V
Type I error
m0
# non-true H T
Type II error
S m1
totals m - R R m
TruthDecision
Strong Vs. Weak Control
All probabilities are conditional on which hypotheses are true
Strong control refers to control of the Type I error rate under any combination of true and false nulls
Weak control refers to control of the Type I error rate only under the complete null hypothesis (i.e. all nulls true)
In general, weak control without other safeguards is unsatisfactory
Adjusted p-values (p*)
Test level (e.g. 0.05) does not need to be determined in advance
Some procedures most easily described in terms of their adjusted p-values
Usually easily estimated using resampling
Procedures can be readily compared based on the corresponding adjusted p-values
A Little Notation
For hypothesis Hj, j = 1, …, m
observed test statistic: tj
observed unadjusted p-value: pj
Ordering of observed (absolute) tj: {rj}
such that |tr1| |tr2
| … |trG|
Ordering of observed pj: {rj}
such that |pr1| |pr2
| … |prG|
Denote corresponding RVs by upper case letters (T, P)
Control of the type I errors
Bonferroni single-step adjusted p-values
pj* = min (mpj, 1)
Sidak single-step (SS) adjusted p-values
pj * = 1 – (1 – pj)m
Sidak free step-down (SD) adjusted p-values
pj * = 1 – (1 – p(j))(m – j + 1)
Control of the type I errors
Holm (1979) step-down adjusted p-values
prj* = maxk = 1…j {min ((m-k+1)prk, 1)}
Intuitive explanation: once H(1) rejected by Bonferroni, there are only m-1 remaining hyps that might still be true (then another Bonferroni, etc.)
Hochberg (1988) step-up adjusted p-values (Simes inequality)
prj* = mink = j…m {min ((m-k+1)prk, 1) }
Control of the type I errors
Westfall & Young (1993) step-down minP adjusted p-values
prj* = maxk = 1…j { p(maxl{rk…rm} Pl prk H0C )}
Westfall & Young (1993) step-down maxT adjusted p-values
prj* = maxk = 1…j { p(maxl{rk…rm} |Tl| ≥ |trk| H0C )}
Westfall & Young (1993) Adjusted p-values
Step-down procedures: successively smaller adjustments at each step
Take into account the joint distribution of the test statistics
Less conservative than Bonferroni, Sidak, Holm, or Hochberg adjusted p-values
Can be estimated by resampling but computer-intensive (especially for minP)