Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 ·...
Transcript of Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 ·...
![Page 1: Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 · Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto](https://reader033.fdocuments.net/reader033/viewer/2022042903/5f88a881131af40aab206706/html5/thumbnails/1.jpg)
Colon cancer subtypes from gene expression data
Nathan Cunningham Giuseppe Di Benedetto Sherman IpLeon Law
Module 6: Applied Statistics
26th February 2016
![Page 2: Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 · Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto](https://reader033.fdocuments.net/reader033/viewer/2022042903/5f88a881131af40aab206706/html5/thumbnails/2.jpg)
Aim
I Replicate findings of Felipe De Sousa et al. (2013)I Cluster analysis to identify subtypes of colon cancerI Construct a classifier to identify clustersI Identify a suitable subset of the data to perform these analyses
I Consider robustness of findings to changes in methods andperturbations in the data
![Page 3: Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 · Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto](https://reader033.fdocuments.net/reader033/viewer/2022042903/5f88a881131af40aab206706/html5/thumbnails/3.jpg)
Data
I GSE33113 data set (Academic Medical Centre, Amsterdam)
I Patients with stage II colon cancer
I 90 patients, 54, 675 gene expressions recorded
![Page 4: Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 · Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto](https://reader033.fdocuments.net/reader033/viewer/2022042903/5f88a881131af40aab206706/html5/thumbnails/4.jpg)
Data processing
I Normalisation to remove batch effects
I Gene expression presence detected using barcode algorithmand those not present in at least one sample removed
I Genes with a median absolute deviation > 0.5 retained andmedian centred
I Felipe De Sousa et al. (2013) find 7, 846 genes remain — wefind anywhere from none to all of the genes remain
I Use 146 genes identified by Felipe De Sousa et al. (2013) inanalyses
![Page 5: Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 · Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto](https://reader033.fdocuments.net/reader033/viewer/2022042903/5f88a881131af40aab206706/html5/thumbnails/5.jpg)
Cluster Analysis
I Hierarchical – agglomerative, average linkage
I K-Means
I Consensus
I Model-based (Fraley & Raftery, 2002)
![Page 6: Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 · Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto](https://reader033.fdocuments.net/reader033/viewer/2022042903/5f88a881131af40aab206706/html5/thumbnails/6.jpg)
How many clusters?
![Page 7: Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 · Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto](https://reader033.fdocuments.net/reader033/viewer/2022042903/5f88a881131af40aab206706/html5/thumbnails/7.jpg)
Clustering methods comparison
I Homogeneity: reflects compactness of the clusters
I Separation: reflects the distance between clusters
I Silouette: s(i) = b(i)−a(i)max{a(i),(b(i)}
I WADP (weighted avarage discrepancy pairs)
![Page 8: Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 · Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto](https://reader033.fdocuments.net/reader033/viewer/2022042903/5f88a881131af40aab206706/html5/thumbnails/8.jpg)
Clustering methods comparison
I Homogeneity: reflects compactness of the clusters
I Separation: reflects the distance between clusters
I Silouette: s(i) = b(i)−a(i)max{a(i),(b(i)}
I WADP (weighted avarage discrepancy pairs)
![Page 9: Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 · Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto](https://reader033.fdocuments.net/reader033/viewer/2022042903/5f88a881131af40aab206706/html5/thumbnails/9.jpg)
Robustness under perturbation
●●● ●
● ●
●
●●● ●
●
● ●
●
●●
●
●
●
●
0.00
0.05
0.10
0.15
0.20
0.0 0.5 1.0 1.5 2.0sd
valu
e
variable●
●
●
cons_kmeans
mclust
cons_hierclust
![Page 10: Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 · Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto](https://reader033.fdocuments.net/reader033/viewer/2022042903/5f88a881131af40aab206706/html5/thumbnails/10.jpg)
Cluster methods comparison
Cluster comparison WADP valueC-k-means VS C-hierarchical 0.070
MClust VS C-hierarchical 0.015C-k-means VS Mclust 0.081
![Page 11: Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 · Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto](https://reader033.fdocuments.net/reader033/viewer/2022042903/5f88a881131af40aab206706/html5/thumbnails/11.jpg)
Classification: PAM
I R package for implementing nearest shrunken centroidclassification.
I Gives higher weights to genes in a class that are far away fromthe overall centroid of the genes.
I A threshold parameter specifies a shrinkage for the weightsgiving higher weights to genes which are stable within theclass.
I Can eliminate the ‘weaker’ effect of genes, allowing automaticfeature selection.
I Classification by considering the smallest distance to theshrunken centroid.
![Page 12: Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 · Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto](https://reader033.fdocuments.net/reader033/viewer/2022042903/5f88a881131af40aab206706/html5/thumbnails/12.jpg)
Classification: PAM
I R package for implementing nearest shrunken centroidclassification.
I Gives higher weights to genes in a class that are far away fromthe overall centroid of the genes.
I A threshold parameter specifies a shrinkage for the weightsgiving higher weights to genes which are stable within theclass.
I Can eliminate the ‘weaker’ effect of genes, allowing automaticfeature selection.
I Classification by considering the smallest distance to theshrunken centroid.
![Page 13: Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 · Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto](https://reader033.fdocuments.net/reader033/viewer/2022042903/5f88a881131af40aab206706/html5/thumbnails/13.jpg)
Classification: PAM
I R package for implementing nearest shrunken centroidclassification.
I Gives higher weights to genes in a class that are far away fromthe overall centroid of the genes.
I A threshold parameter specifies a shrinkage for the weightsgiving higher weights to genes which are stable within theclass.
I Can eliminate the ‘weaker’ effect of genes, allowing automaticfeature selection.
I Classification by considering the smallest distance to theshrunken centroid.
![Page 14: Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 · Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto](https://reader033.fdocuments.net/reader033/viewer/2022042903/5f88a881131af40aab206706/html5/thumbnails/14.jpg)
Classification: PAM
I R package for implementing nearest shrunken centroidclassification.
I Gives higher weights to genes in a class that are far away fromthe overall centroid of the genes.
I A threshold parameter specifies a shrinkage for the weightsgiving higher weights to genes which are stable within theclass.
I Can eliminate the ‘weaker’ effect of genes, allowing automaticfeature selection.
I Classification by considering the smallest distance to theshrunken centroid.
![Page 15: Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 · Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto](https://reader033.fdocuments.net/reader033/viewer/2022042903/5f88a881131af40aab206706/html5/thumbnails/15.jpg)
Classification: PAM
I R package for implementing nearest shrunken centroidclassification.
I Gives higher weights to genes in a class that are far away fromthe overall centroid of the genes.
I A threshold parameter specifies a shrinkage for the weightsgiving higher weights to genes which are stable within theclass.
I Can eliminate the ‘weaker’ effect of genes, allowing automaticfeature selection.
I Classification by considering the smallest distance to theshrunken centroid.
![Page 16: Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 · Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto](https://reader033.fdocuments.net/reader033/viewer/2022042903/5f88a881131af40aab206706/html5/thumbnails/16.jpg)
Classification: Multi-Class SVM
I The R package e1071 was used to perform the multi-classSVM with a RBF kernel.
I Uses a one vs one approach (i.e. 3 binary classifiers) with classprediction done by a voting scheme.
I If a linear kernel was used instead, could perform featureselection based on ranking of the features using their weights,
![Page 17: Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 · Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto](https://reader033.fdocuments.net/reader033/viewer/2022042903/5f88a881131af40aab206706/html5/thumbnails/17.jpg)
Classification: Multi-Class SVM
I The R package e1071 was used to perform the multi-classSVM with a RBF kernel.
I Uses a one vs one approach (i.e. 3 binary classifiers) with classprediction done by a voting scheme.
I If a linear kernel was used instead, could perform featureselection based on ranking of the features using their weights,
![Page 18: Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 · Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto](https://reader033.fdocuments.net/reader033/viewer/2022042903/5f88a881131af40aab206706/html5/thumbnails/18.jpg)
Classification: Multi-Class SVM
I The R package e1071 was used to perform the multi-classSVM with a RBF kernel.
I Uses a one vs one approach (i.e. 3 binary classifiers) with classprediction done by a voting scheme.
I If a linear kernel was used instead, could perform featureselection based on ranking of the features using their weights,
![Page 19: Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 · Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto](https://reader033.fdocuments.net/reader033/viewer/2022042903/5f88a881131af40aab206706/html5/thumbnails/19.jpg)
Classification: Random Forest
I The R package randomForest was used to train a randomforest.
I A total of 300 trees were built, with 12 variables randomlychosen as candidates at each split.
I Feature selection can be done using mean decrease accuracy,which uses permutation of the features and out of bag error.
![Page 20: Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 · Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto](https://reader033.fdocuments.net/reader033/viewer/2022042903/5f88a881131af40aab206706/html5/thumbnails/20.jpg)
Classification: Random Forest
I The R package randomForest was used to train a randomforest.
I A total of 300 trees were built, with 12 variables randomlychosen as candidates at each split.
I Feature selection can be done using mean decrease accuracy,which uses permutation of the features and out of bag error.
![Page 21: Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 · Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto](https://reader033.fdocuments.net/reader033/viewer/2022042903/5f88a881131af40aab206706/html5/thumbnails/21.jpg)
Classification: Random Forest
I The R package randomForest was used to train a randomforest.
I A total of 300 trees were built, with 12 variables randomlychosen as candidates at each split.
I Feature selection can be done using mean decrease accuracy,which uses permutation of the features and out of bag error.
![Page 22: Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 · Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto](https://reader033.fdocuments.net/reader033/viewer/2022042903/5f88a881131af40aab206706/html5/thumbnails/22.jpg)
Results: PAM
0 2 4 6 8 10
Value of threshold
Mis
clas
sific
atio
n E
rror
146 146 145 127 110 89 70 57 46 35 28 19 13 7 7 6 4
Number of genes
0.0
0.4
0.8
x x x x x x x x x x x x x x x x x x x
0 2 4 6 8 10
Value of threshold
Mis
clas
sific
atio
n E
rror
146 146 145 127 110 89 70 57 46 35 28 19 13 7 7 6 4
0.0
0.4
0.8
Label 1
Label 2
Label 3
Figure: 10-fold cross validation error. Optimal threshold was estimated tobe 6.2 ± 0.2.
![Page 23: Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 · Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto](https://reader033.fdocuments.net/reader033/viewer/2022042903/5f88a881131af40aab206706/html5/thumbnails/23.jpg)
Results: SVM and Random Forest and PAM
Method 10-Fold Cross Validation Error
SVM (C = 1, γ = 0.0068) 1.1%PAM (threshold = 6.2) 2.2%
Random Forest 3.3%
Table: 10-fold cross validation average error on the trained classifiers
Error bars can be estimated using bootstrapping.
![Page 24: Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 · Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto](https://reader033.fdocuments.net/reader033/viewer/2022042903/5f88a881131af40aab206706/html5/thumbnails/24.jpg)
Results: SVM and Random Forest and PAM
Method 10-Fold Cross Validation Error
SVM (C = 1, γ = 0.0068) 1.1%PAM (threshold = 6.2) 2.2%
Random Forest 3.3%
Table: 10-fold cross validation average error on the trained classifiers
Error bars can be estimated using bootstrapping.
![Page 25: Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 · Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto](https://reader033.fdocuments.net/reader033/viewer/2022042903/5f88a881131af40aab206706/html5/thumbnails/25.jpg)
Results: PAM Bootstrapping
0 2 4 6 8 10
010
2030
Threshold (unknown units)
Val
idat
ion
Err
or (
%)
Figure: Median (point) and 95% percentile (error bar) of the 10-foldcross validation error, bootstrapping 500 times.
![Page 26: Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 · Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto](https://reader033.fdocuments.net/reader033/viewer/2022042903/5f88a881131af40aab206706/html5/thumbnails/26.jpg)
Results: PAM Bootstrapping
0 2 4 6 8 10
020
4060
8010
014
0
Threshold (unknown units)Num
ber
of g
enes
whi
ch s
urvi
ved
thre
shol
ding
(ge
nes)
Figure: Mean (point) and standard deviation (error bar) of the number ofgenes which survived thresholding, bootstrapping 500 times.
![Page 27: Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 · Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto](https://reader033.fdocuments.net/reader033/viewer/2022042903/5f88a881131af40aab206706/html5/thumbnails/27.jpg)
Results: PAM Bootstrapping
Method 10-Fold Cross Validation Error
PAM (threshold = 0.0)(1.1+2.2−1.1
)%
PAM (threshold = 6.2)(2.2+3.3−2.2
)%
SVM(1.1+1.1−0.0
)%
Random Forest(1.1+2.2−1.1
)%
Table: Median and 95% percentile of the 10-fold cross validation error,bootstrapping 500 times.
For PAM with threshold 6.2, (36.5 ± 6.3) genes survivedthresholding.
![Page 28: Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 · Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto](https://reader033.fdocuments.net/reader033/viewer/2022042903/5f88a881131af40aab206706/html5/thumbnails/28.jpg)
Conclusion
I Clustering methods were robust
I PAM performed similar to other methods
I More thresholds to be investigated
I Scale to larger datasets
![Page 29: Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 · Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto](https://reader033.fdocuments.net/reader033/viewer/2022042903/5f88a881131af40aab206706/html5/thumbnails/29.jpg)
References
Felipe De Sousa, E. M., Wang, X., Jansen, M., Fessler, E., Trinh,A., de Rooij, L. P., . . . others (2013). Poor-prognosis coloncancer is defined by a molecularly distinct subtype anddevelops from serrated precursor lesions. Nature medicine,19(5), 614–618.
Fraley, C., & Raftery, A. E. (2002). Model-based clustering,discriminant analysis, and density estimation. Journal of theAmerican statistical Association, 97(458), 611–631.