CS229, Fall 2015cs229.stanford.edu/proj2015/285_poster.pdf · 2017. 9. 23. · Hunter Boyce, Alice...

1
Goals and Data Availability Feature Selection Unsupervised Clustering of Patient Gene Profiles Supervised Classification and Regression Biological Significance: Feature Selection Highlights Gene Ontology Networks and KEGG Pathways Implicated in Cancer Role of alpha-6/beta-4 integrins in carcinoma progression Outcome Pathway FDR Estrogen Receptor TGF-Beta-dependent induction of EMT via MAPK 5.75e-2 Progesterone Receptor Slit-Robot signaling 7.96e-2 Genes in Feature Set TGB-Beta, TGF-Beta 3, PTEN, Claudin-1 SLIT1, SLIT3 Progestrone Receptor Cadherin-mediated cell adhesion 7.96e-2 M-cadherin, P-cadherin HER2 Receptor ERBB-family signaling 5.01e-2 ERBB2, GRB7 HER2 Receptor Role of alpha-6/beta-4 cadherins in carcinoma progression 5.01e-2 ERBB2, ITGB4, 14-3-3 tau Survivability Cell adhesion and ECM remodeling 5.51e-2 Kallikrein 3, Kallikrein2, Kallikrein 1 Survivability Bradykinin/ kallidin maturation 4.88e-2 CPB2, Tissue Kallikreinin, Plasma Kallikreinin Survivability Role of Parkin in the ubiquitin-proteasomal pathway 8.75e-2 UBE1, Cyclin E, HSP70, Tubulin Beta Cell adhesion and ECM remodeling ER Gene set overlaps with ESR1 (nuclear), MALL, Hyaluronic acid, NCOA3, Estradiol network (54/61 genes present in feature set) GO processes Reproductive structure development (20%,5.72e-7), Female Sex Differentiation (10.9%, p=8.24e-6), Growth Hormone Receptor Signaling pathway (5.2%, p=3.43e-18) HER2 gene set overlaps with Survivin, Bcl-6, AKT (PKB), Fra-1, Bim Network (114/180 genes present in feature set) GO processes Regulation of cell death (43%,p=2.4e-29) Regulation of apoptotic process (41.4%,p=9.82e-29) Regulation of programmed cell death (41.4%,p=1.65e-28), Apoptotic signaling pathway (22.5%,p=4.78e-27) ANOVA p-value for Survivability = 1.80e-4 Tukey HSD Difference of Means (Days) Expected +346 -346 Lower +304 -304 Upper +386 -386 Hierarchical Clustering Breast cancer is the most commonly diagnosed and the second leading cause of death among women. The highly heterogeneous disease has many different clonal subtypes that cause different patient treatment responses due to the genetic differences from the disease and the patient. To better understand the complicated nature of this disease, The Cancer Genome Atlas (TCGA), a rich database of genetic and electronic medical information, can be mined to identify distinctive features for each clonal subtype and patient survivability. Using the significant genes found, unsupervised and supervised methods can be applied to find genetic differences within clonal subgroups and to predict clinical outcomes based on these given features. Chi-Squared p-value for Estrogen Receptor = 4.95e-43 Outcome Chi-Squared Standardized Residuals 0 -1.0 2.4 1 1.0 -2.5 -7.8 16.3 -1.5 -4.2 -2.2 -3.6 -2.8 7.9 -16.1 1.3 4.2 2.3 3.6 2.8 Genes with the strongest effect size Genes with the strongest effect size K-Means Clustering Forward subset selection 1. Linear/logistic regression of each feature against nominal/categorical outcomes (respectively) was performed. 2. Features with FDR >0.05 for association with outcome were excluded. 3. regsubsets library in R used to perform greedy forward subsets selection on remaining features. 4. Feature subset size selected to minimize the BIC. Elastic net regularization and Group lasso Glmnet package in R Input: RNA-Seq gene expression levels for 20,531 genes across 800 subjects Pre-proecessing: 1. Removed 475 genes with low expression (< 1FPKM) 2. Removed 699 genes with <0.1FPKM^2 expression variance Result: Survivability: 277 genes; Age of Diagnosis: 194 genes; Cancer stage: 140 genes; Estrogen Receptor: 176 genes; Progestone Receptor: 175 genes; Her2 Receptor: 52 genes Using 10-fold cross-validation and selecting feature subset size to minimize BIC: P-Value = 1.26e-31 P-Value = 0.89 P-Value = 1.38e-07 P-Value = 1.66e-26 Group lasso (alpha=1): Elastic net regularization (alpha=0.5): The number of patients in each clinical categorical feature and the colored bars represent data used in further analysis. Forward Selection Elastic Net Regulatization Group Lasso 391 149 156 72 31 204 7 Regressor RMSE Median Absolute Error SVR Linear Kernel 764.29 497.39 SVR Sigmoid Kernel 1212.04 909.00 R 2 0.60 0.00 Linear Regression 479.35 381.36 0.84 Patient Distribution in Breast Cancer TCGA Clinical Features TRIM72 ZFP42 LACRT TMIGD1 ACSM1 CCNE1 SCUBE2 PP1R1C IL19 CPB2 PCA and K-Means++ were used to create the clusters below. The distinctive groups suggest more patient subgroups that can be helpful for more precise treatments. Regressor RMSE Median Absolute Error SVR Linear Kernel 2.31 1.36 SVR Sigmoid Kernel 2.54 0.90 R 2 0.10 - 0.08 Linear Regression 2.05 1.35 0.30 Tumor Stage ZNF705A, LCE3E, PZP PRDM14, ZPBP2 ZIC4, LOC93432, LCT, GSDMC, B3GNT5, SOX11,SEL1L3, HAPLN3, MALL, CLDN1, CDH3, MMP1, S100A1, PADI2 Survivability Survival Analysis Prediction of clonal subtypes in breast invasive carcinoma Hunter Boyce, Alice Yu, Anna Shcherbina CS229, Fall 2015 Range: 1-10 Range: 158-4456 days p53 Bcl-6 Red bars indicate that pathway gene is present in the feature set Classification Receiver Operator Characteristic curves for the prediction of two subtypes of cancer. Area under the curve given with the classifier in the legend. Regression Statistical metrics of regression performance on RNA expression for predicting tumor stage and survival time. Kaplan-Meier plots for the survival of sub-populations of patients. Results show the opposite of what is expected. This should be further explored.

Transcript of CS229, Fall 2015cs229.stanford.edu/proj2015/285_poster.pdf · 2017. 9. 23. · Hunter Boyce, Alice...

Page 1: CS229, Fall 2015cs229.stanford.edu/proj2015/285_poster.pdf · 2017. 9. 23. · Hunter Boyce, Alice Yu, Anna Shcherbina CS229, Fall 2015 Range: 1-10 Range: 158-4456 days p53 Bcl-6

Goals and Data Availability

Feature Selection

Unsupervised Clustering of Patient Gene Profiles Supervised Classification and Regression Biological Significance Feature Selection

Highlights Gene Ontology Networks and KEGG Pathways Implicated in Cancer

Role of alpha-6beta-4 integrins in carcinoma progression

Outcome Pathway FDR

Estrogen Receptor TGF-Beta-dependent induction of EMT via MAPK 575e-2

Progesterone Receptor

Slit-Robot signaling 796e-2

Genes in Feature SetTGB-Beta TGF-Beta 3

PTEN Claudin-1SLIT1 SLIT3

Progestrone Receptor

Cadherin-mediated cell adhesion 796e-2 M-cadherin P-cadherin

HER2 Receptor ERBB-family signaling 501e-2 ERBB2 GRB7

HER2 Receptor Role of alpha-6beta-4 cadherins in carcinoma progression

501e-2 ERBB2 ITGB4 14-3-3 tau

Survivability Cell adhesion and ECM remodeling

551e-2 Kallikrein 3 Kallikrein2 Kallikrein 1

Survivability Bradykinin kallidin maturation 488e-2 CPB2 Tissue Kallikreinin Plasma Kallikreinin

Survivability Role of Parkin in the ubiquitin-proteasomal pathway

875e-2 UBE1 Cyclin E HSP70 Tubulin Beta

Cell adhesion and ECM remodeling

ER Gene set overlaps with ESR1 (nuclear) MALL Hyaluronic acid

NCOA3 Estradiol network (5461 genes present in feature set)

GO processes

Reproductive structure development (20572e-7)

Female Sex Differentiation (109 p=824e-6)

Growth Hormone Receptor Signaling pathway

(52 p=343e-18)

HER2 gene set overlaps with Survivin Bcl-6 AKT (PKB) Fra-1 Bim

Network(114180 genes present in feature set)

GO processes

Regulation of cell death (43p=24e-29)

Regulation of apoptotic process (414p=982e-29)

Regulation of programmed cell death (414p=165e-28)Apoptotic signaling pathway

(225p=478e-27)

ANOVA p-value for Survivability = 180e-4

Tukey HSD Difference of Means (Days)

Expected +346 -346

Lower +304 -304

Upper +386 -386

Hierarchical ClusteringBreast cancer is the most commonly diagnosed and the second leading cause of death among women The highly heterogeneous disease has many different clonal subtypes that cause different patient treatment responses due to the genetic differences from the

disease and the patient To better understand the complicated nature of this disease The Cancer Genome

Atlas (TCGA) a rich database of genetic and electronic medical information can be mined to identify distinctive features for each clonal subtype and patient survivability

Using the significant genes found unsupervised and supervised methods can be applied to find genetic differences within clonal subgroups and to predict clinical

outcomes based on these given features

Chi-Squared p-value for Estrogen Receptor = 495e-43

Outcome Chi-Squared Standardized Residuals0 -10 24

1 10 -25

-78 163 -15 -42 -22 -36 -28

79 -161 13 42 23 36 28

Gen

es w

ith th

e st

rong

est e

ffect

siz

e

Gen

es w

ith th

e st

rong

est

effe

ct s

ize

K-Means Clustering

Forward subset selection 1 Linearlogistic regression of each feature against nominalcategorical

outcomes (respectively) was performed

2 Features with FDR gt005 for association with outcome were

excluded3 regsubsets library in R used to perform greedy forward subsets selection on remaining features

4 Feature subset size selected to minimize the BIC

Elastic net regularizationand Group lasso

Glmnet package in R

Input RNA-Seq gene expression levels for 20531 genes

across 800 subjects

Pre-proecessing 1 Removed 475 genes with low expression (lt 1FPKM)

2 Removed 699 genes with lt01FPKM^2 expression variance

ResultSurvivability 277 genes Age of Diagnosis 194 genes

Cancer stage 140 genes Estrogen Receptor 176 genes Progestone Receptor 175 genes Her2 Receptor 52 genes

Using 10-fold cross-validationand selecting feature subset size to

minimize BIC

P-Value = 126e-31 P-Value = 089

P-Value = 138e-07 P-Value = 166e-26

Group lasso (alpha=1)

Elastic net regularization (alpha=05)

The number of patients in each clinical categorical feature and the colored bars represent data used in further analysis

Forward Selection

Elastic Net Regulatization

Group Lasso

391

149 15672

31

204

7

Regressor RMSE Median Absolute Error

SVR Linear Kernel 76429 49739

SVR Sigmoid Kernel 121204 90900

R2

060

000

Linear Regression 47935 38136 084

Patient Distribution in Breast Cancer TCGA Clinical Features

TRIM72 ZFP42

LACRT TMIGD1

ACSM1 CCNE1

SCUBE2 PP1R1C

IL19 CPB2

PCA and K-Means++ were used to create the clusters below The distinctive groups suggest more patient subgroups that can be helpful for more precise treatments

Regressor RMSE Median Absolute Error

SVR Linear Kernel 231 136

SVR Sigmoid Kernel 254 090

R2

010

- 008

Linear Regression 205 135 030

Tumor Stage ZNF705A LCE3E

PZP

PRDM14ZPBP2

ZIC4 LOC93432 LCT GSDMC

B3GNT5 SOX11SEL1L3HAPLN3 MALL CLDN1 CDH3 MMP1 S100A1

PADI2

Survivability

Survival Analysis

Prediction of clonal subtypes in breast invasive carcinoma Hunter Boyce Alice Yu Anna Shcherbina

CS229 Fall 2015

Range 1-10

Range 158-4456 days

p53 Bcl-6

Red bars indicate that pathway gene is present in the feature set

ClassificationReceiver Operator Characteristic curves for the prediction of two subtypes of cancer

Area under the curve given with the classifier in the legend

RegressionStatistical metrics of regression performance on RNA expression for predicting tumor

stage and survival time

Kaplan-Meier plots for the survival of sub-populations of patients Results show the opposite of what is expected This should be further explored