Analysis of Globally-Coherent Data Sets a two-day course for ...First set the color de nition...

11
Analysis of Globally-Coherent Data Sets a two-day course for GlaxoSmithKline Yinyin Yuan and Mauro Castro * http://www.markowetzlab.org/GCDcourse/ [email protected] Stevenage, 28 - 29th Oct 2010 Abstract This tutorial refers to the practical session on day one of the course Analysis of Globally-Coherent Data Sets. The topic is Molecular Data Integration with R. Part I 1 Introduction Globally-coherent datasets (GCDs) contain (at least) three levels of information (i) genome-wide DNA variation, (ii) an intermediate trait, as well as (iii) a (clinical) phenotype. Intermediate traits are typically gene expression, but may also include proteomic, metabolomic, and other molecular data. These data sets make it possible to dissect how a genomic perturbation (e.g. a somatic copy-number alteration) leads to changes in cellular networks and pathways, which then shape the phenotype (e.g. how aggressive a type of cancer is). Examples of GCDs are the The Cancer Genome Atlas, the International Cancer Genome Consortium, the METABRIC project at the CRI in Cambridge, as well as the data collected by SAGE Bionetworks. The challenge of GCDs is to gain a global understanding of how the different layers of information are connected. While effective statistical methods can provide a system-level view of the genomic landscape, network visualization methods are key to visualizing complex data sets. Together these tools can help to ‘boil down’ the complex multi-layered GCDs into testable hypotheses for in-depth follow-up studies. Here we exemplify this type of data analysis with a breast cancer dataset comprising CNA, mRNA and patient information [1]. 2 Main setup Load the package ‘Dance’ and ‘lol’ and the datasets that will be used in the Part I of this tutorial. To do this, first set the filepath of the working directory to the folder called ‘chin07’ (e.g. setwd(‘./chin07/’) ). Please, check this folder, it must contain two other folders called ‘dataDir’ and ‘resDir’: the first one contains all data you will need and the second is just for results. Also, the root of this directory must have the folder called ‘Package’ which contains the main tools to run this tutorial. You can also check, at the last page, the complete version information about R, including loaded packages and attachments. Then just follow the commented workflow below. Data input Install the main packages and set the data folders. > install.packages( ../Package/lol_0.5.tar.gz , repos=NULL, type= source ) > install.packages( ../Package/HTSanalyzeR_1.2.5.tar.gz , repos=NULL, type= source ) > install.packages( ../Package/IGIR_0.1.zip , repos=NULL, type= source ) > install.packages( ../Package/Dance_0.9.tar.gz , repos=NULL, type= source ) * Cambridge Research Institute - CRUK, Li Ka Shing Centre, Robinson Way Cambridge, CB2 0RE, UK. 1

Transcript of Analysis of Globally-Coherent Data Sets a two-day course for ...First set the color de nition...

  • Analysis of Globally-Coherent Data Sets

    a two-day course for GlaxoSmithKline

    Yinyin Yuan and Mauro Castro ∗

    http://www.markowetzlab.org/GCDcourse/

    [email protected]

    Stevenage, 28 - 29th Oct 2010

    Abstract

    This tutorial refers to the practical session on day one of the course Analysis of Globally-Coherent Data Sets.The topic is Molecular Data Integration with R.

    Part I

    1 Introduction

    Globally-coherent datasets (GCDs) contain (at least) three levels of information (i) genome-wide DNA variation,(ii) an intermediate trait, as well as (iii) a (clinical) phenotype. Intermediate traits are typically gene expression,but may also include proteomic, metabolomic, and other molecular data. These data sets make it possible todissect how a genomic perturbation (e.g. a somatic copy-number alteration) leads to changes in cellular networksand pathways, which then shape the phenotype (e.g. how aggressive a type of cancer is). Examples of GCDsare the The Cancer Genome Atlas, the International Cancer Genome Consortium, the METABRIC project atthe CRI in Cambridge, as well as the data collected by SAGE Bionetworks.

    The challenge of GCDs is to gain a global understanding of how the different layers of information areconnected. While effective statistical methods can provide a system-level view of the genomic landscape,network visualization methods are key to visualizing complex data sets. Together these tools can help to ‘boildown’ the complex multi-layered GCDs into testable hypotheses for in-depth follow-up studies.

    Here we exemplify this type of data analysis with a breast cancer dataset comprising CNA, mRNA andpatient information [1].

    2 Main setup

    Load the package ‘Dance’ and ‘lol’ and the datasets that will be used in the Part I of this tutorial. To do this,first set the filepath of the working directory to the folder called ‘chin07’ (e.g. setwd(‘./chin07/’) ). Please,check this folder, it must contain two other folders called ‘dataDir’ and ‘resDir’: the first one contains all datayou will need and the second is just for results. Also, the root of this directory must have the folder called‘Package’ which contains the main tools to run this tutorial. You can also check, at the last page, the completeversion information about R, including loaded packages and attachments. Then just follow the commentedworkflow below.

    Data input

    Install the main packages and set the data folders.

    > install.packages('../Package/lol_0.5.tar.gz', repos=NULL, type='source')

    > install.packages('../Package/HTSanalyzeR_1.2.5.tar.gz', repos=NULL, type='source')

    > install.packages('../Package/IGIR_0.1.zip', repos=NULL, type='source')

    > install.packages('../Package/Dance_0.9.tar.gz', repos=NULL, type='source')

    ∗Cambridge Research Institute - CRUK, Li Ka Shing Centre, Robinson Way Cambridge, CB2 0RE, UK.

    1

  • > library(Dance)

    > library(lol)

    > dataDir resDir ER er names(er) ge.data ge.h idx ge.data ge.h cn.data cn.h idx cn.data cn.h rownames(cn.data) commonCol commonCol cn.data ge.data er ge.data ge.h cn.data cn.h

  • Now run the EM algorithm calling discrete copy number:

    # minls threshold nSig

  • First set the color definition (default colors are not pretty!).

    > load('./dataDir/pam50colors.rdata')

    > load('./dataDir/iclustercolors.rdata')

    > pam50clusters pam50Subtype.color color.code color.code data fileName breaks pdf(paste('./resDir/', fileName,'.pdf', sep=''))

    > heatmap(as.matrix(data),

    + distfun=function(x) dist(x,method='manhattan'),

    + hclustfun=function(x) hclust(x,method='ward'),

    + col=color.code,breaks=breaks,

    + labRow=GE.cyto,scale='none',ColSideColors=pam50Subtype.color)

    > dev.off()

    2386

    2104

    2081

    2203

    2087

    2176

    2095

    2453

    2479

    2375

    2421

    2191

    2111

    2419

    2236

    2181

    2186

    2204

    2237

    2496

    2209

    2154

    2402

    2461

    2468

    2086

    2196

    2185

    2100

    2229

    2091

    2219

    2183

    2401

    2227

    2217

    2438

    2327

    2134

    2079

    2083

    2099

    2153

    2166

    2462

    2231

    2222

    2377

    2184

    2138

    2210

    2199

    2171

    2097

    2202

    2211

    2218

    2207

    2484

    2393

    2228

    2107

    2172

    2374

    2380

    2124

    2125

    2105

    2080

    2198

    2174

    2142

    2156

    2110

    2233

    2372

    2456

    2504

    2149

    2492

    2499

    2098

    2405

    2114

    2160

    2436

    2180

    2498

    2420

    2200

    2167

    2157

    2113

    2109

    2163

    2378

    2096

    2108

    2159

    2075

    2077

    2078

    2201

    2162

    2230

    2225

    16q2116q22.11q24.216q24.31q23.31q441q25.31q42.131q21.21q23.31q21.11q42.131q42.131q221q221q411q23.31q21.21q21.21q441q21.21q21.31q25.31q32.31q221q221q21.21q32.11q21.31q411q21.116q12.216q24.11q21.21q25.31q21.216q12.21q23.31q411q24.21q42.121q42.131q32.21q24.21q32.11q32.21q25.31q24.21q21.11q23.21q32.21q32.11q42.111q42.21q42.121q441q411q411q42.131q21.21q24.11q24.21q32.11q32.31q23.21q24.21q32.31q42.21q42.21q42.31q32.31q411q23.31q24.11q21.31q24.21q21.31q21.31q42.31q32.11q25.31q25.11q221q221q23.31q25.11q24.31q23.31q411q42.21q32.31q32.31q411q411q32.31q32.11q42.31q25.11q42.31q32.11q31.21q441q25.21q25.21q441q42.31q32.11q32.11q32.11q42.131q32.31q32.11q42.121q32.11q32.21q32.31q42.121q23.21q23.21q32.21q32.11q32.11q25.11q23.21q21.21q221q21.21q21.11q21.21q21.11q21.11q21.11q32.31q25.31q411q25.21q24.11q441q21.31q221q23.11q221q21.31q42.21q23.11q23.21q21.21q21.21q221q21.11q221q21.31q21.21q221q21.21q21.31q21.31q21.31q21.31q21.31q23.21q23.21q221q21.31q23.21q221q32.11q25.21q411q24.21q32.31q23.31q21.21q32.11q23.21q23.31q23.21q21.21q21.11q24.21q21.31q23.11q23.31q21.21q23.31q24.31q31.11q24.11q23.31q23.31q411q32.21q42.31q441q32.11q431q21.31q32.11q23.31q42.121q42.131q42.131q42.131q23.31q221q42.131q42.131q42.121q32.11q411q21.21q441q31.21q431q42.131q42.21q42.111q31.31q32.11q21.31q431q42.131q411q21.11q431q441q25.21q23.21q221q21.31q32.11q25.31q411q32.11q23.31q23.31q21.21q21.21q42.21q42.131q21.11q441q221q23.11q221q23.21q221q221q42.111q42.111q21.31q21.31q21.31q21.11q21.21q25.31q221q24.21q221q42.121q32.11q32.11q42.121q42.128q12.31q32.116q24.316p12.116q23.38q21.111q32.11q31.31q431q411q411q23.11q21.316q22.11q21.38q22.28q21.38p11.218p11.218p11.218p128p121q21.21q21.21q21.21q42.131q21.28q24.38q24.38q24.38q24.38q24.38q13.18q24.38q24.38q24.38q24.38q24.38q24.38q24.38q24.38q24.38q24.38q24.138q24.38q24.38q23.18q22.28q24.118q24.128q24.138q24.138q23.28q22.38q22.38q22.28q21.38q24.138q21.118q22.28q22.38q22.116q23.38q11.238q21.28q22.18q22.38q22.28q23.18q24.128q21.118q21.118q13.28q22.38q22.28q24.138q21.138q22.38q22.18q24.111q32.18q13.28q13.28q22.38q11.238q24.138q24.228q24.228q22.18q22.38q22.38q24.128q24.138q24.138q24.138q24.138q22.18q24.138q22.18q24.128q24.118q11.238q21.28q21.138q22.38q22.38q21.38q22.18q22.18q24.118q24.138q24.1317q1217q1217q121q21.31q21.316q22.31q42.1116q22.11q32.116q24.216q22.116q131q221q42.216q23.216q24.316q24.316q24.316q23.116q22.116q22.116q1316q22.116q24.316q22.116q22.116q24.216q22.116q24.316q24.216q22.316q23.216q22.116q1316q131q23.31q21.31q24.21q23.31q42.131q42.1216q24.11q32.116q24.316q24.11q23.31q23.216q22.11q23.11q31.31q411q32.11q32.31q4316q22.31q21.31q42.21q411q221q21.31q21.31q23.11q221q21.316q1316q1316q1316q131q32.21q25.38q24.138q24.138q24.218q22.18q24.2216q24.316q22.116q131q32.1

    Figure 1: Result of the hierarchical cluster analysis (GE data).

    4

  • Another feature set (CN data).

    > data fileName breaks pdf(paste('./resDir/', fileName,'.pdf', sep=''))

    > heatmap(as.matrix(data),

    + distfun=function(x) dist(x,method='manhattan'),

    + hclustfun=function(x) hclust(x,method='ward'),

    + col=color.code,breaks=breaks,

    + labRow=CN.cyto,scale='none',ColSideColors=pam50Subtype.color)

    > dev.off()S

    2181

    S24

    53S

    2496

    S24

    68S

    2227

    S21

    76S

    2419

    S20

    95S

    2402

    S21

    34S

    2231

    S22

    37S

    2327

    S22

    04S

    2210

    S21

    83S

    2083

    S20

    86S

    2438

    S22

    03S

    2196

    S24

    79S

    2202

    S23

    86S

    2087

    S24

    21S

    2104

    S21

    71S

    2436

    S23

    77S

    2504

    S20

    79S

    2081

    S23

    75S

    2207

    S21

    56S

    2107

    S20

    97S

    2393

    S22

    28S

    2138

    S22

    11S

    2219

    S24

    84S

    2091

    S20

    96S

    2186

    S21

    84S

    2154

    S22

    00S

    2218

    S21

    60S

    2217

    S22

    22S

    2153

    S20

    99S

    2166

    S21

    49S

    2111

    S22

    36S

    2461

    S21

    57S

    2167

    S21

    42S

    2185

    S23

    72S

    2405

    S24

    62S

    2159

    S20

    98S

    2114

    S23

    78S

    2174

    S21

    72S

    2077

    S21

    09S

    2110

    S23

    80S

    2125

    S22

    29S

    2124

    S21

    00S

    2080

    S22

    09S

    2163

    S24

    01S

    2233

    S21

    99S

    2108

    S21

    13S

    2499

    S24

    20S

    2492

    S24

    98S

    2230

    S21

    91S

    2456

    S21

    05S

    2198

    S21

    80S

    2162

    S20

    78S

    2075

    S23

    74S

    2201

    S22

    25

    16q1316q1316q22.116q22.316q22.116q22.116q22.116q1316q1316q1316q23.216q23.216q23.116q23.316q12.216q12.216q24.216q24.216q24.116q24.316q24.317q1217q128p128p11.218p11.2116p12.18q24.38q24.38q24.38q24.38q24.38q11.238q13.18q13.28q21.138q21.118q21.118q21.138q21.28q21.38q21.38q22.18q22.18q22.18q22.28q24.228q24.228q24.128q24.138q24.138q24.118q23.18q24.118q22.38q22.38q22.38q22.31q21.11q21.11q21.11q21.11q21.21q21.21q21.21q21.21q21.21q21.31q21.31q221q221q221q21.31q21.31q221q23.11q23.11q221q221q221q221q221q21.31q21.31q21.31q42.131q42.131q441q431q42.31q42.21q42.21q42.131q32.11q32.11q32.11q32.11q32.11q32.11q42.131q32.21q32.31q23.31q23.21q23.31q23.31q31.31q31.21q31.31q31.31q25.21q25.31q25.21q25.31q23.31q23.31q25.11q24.21q24.21q24.1

    Figure 2: Result of the hierarchical cluster analysis (CN data).

    3.2 Dirichlet process mixture model clustering

    We first load the package BHC. This library performs Bayesian hierarchical clustering on discretised CN andGE data. It performs bottom-up hierarchical clustering using Dirichlet process to model uncertainty in thedata and Bayesian model selection to decide at each step which cluster merges.

    Load BHC package.

    > library(BHC)

    > samples bhc.cn bhc.ge

  • Plot results.

    > pdf(file='./resDir/BHCplotCN.pdf', width=30, height=8)

    > plot(bhc.cn)

    > dev.off()

    > WriteOutClusterLabels(bhc.cn, "./resDir/BHClabelsCN.txt", verbose=TRUE)

    > pdf(file='./resDir/BHCplotGE.pdf', width=30, height=8)

    > plot(bhc.ge)

    > dev.off()

    > WriteOutClusterLabels(bhc.ge, "./resDir/BHClabelsGE.txt", verbose=TRUE)

    020

    4060

    80

    −5236.224

    −1235.195

    −231.797

    S22

    29.B

    asal

    S24

    01.B

    asal

    S23

    80.L

    umA

    S21

    25.L

    umB

    S21

    00.B

    asal

    S20

    80.L

    umB

    S22

    09.B

    asal

    S21

    80.N

    orm

    al

    S20

    99.H

    er2

    S21

    42.L

    umA

    S21

    63.N

    orm

    al

    S21

    99.L

    umB

    S21

    66.B

    asal

    S22

    33.L

    umA

    S21

    07.L

    umB

    S22

    28.L

    umB

    −16.451

    S22

    00.L

    umA

    S21

    56.L

    umA

    S21

    54.B

    asal

    S22

    18.H

    er2

    S21

    49.B

    asal

    S21

    24.L

    umB

    −291.343

    S21

    08.L

    umB

    S21

    05.L

    umB

    S24

    62.B

    asal

    S22

    01.L

    umA

    S21

    13.L

    umA

    S24

    99.L

    umA

    S21

    72.L

    umB

    S24

    92.L

    umA

    S21

    74.L

    umA

    S24

    20.L

    umA

    S23

    74.L

    umA

    S21

    98.L

    umA

    S22

    25.L

    umA

    S21

    85.B

    asal

    S22

    17.B

    asal

    S20

    78.L

    umA

    S21

    62.B

    asal

    S21

    59.L

    umA

    S21

    53.B

    asal

    S21

    09.N

    orm

    al

    S24

    56.L

    umB

    S24

    98.L

    umA

    S24

    05.L

    umA

    S22

    30.L

    umA

    S21

    14.L

    umA

    S21

    10.L

    umA

    S20

    77.L

    umA

    S23

    78.H

    er2

    S20

    75.L

    umA

    S20

    98.L

    umA

    S21

    91.L

    umA

    −1340.899

    −163.485

    S22

    07.L

    umB

    S21

    38.L

    umB

    S23

    86.N

    orm

    al

    S23

    77.L

    umB

    S24

    84.L

    umB

    S24

    21.N

    orm

    al

    S20

    87.L

    umA

    S24

    36.L

    umA

    S23

    75.L

    umB

    S21

    71.L

    umB

    S23

    72.L

    umA

    S25

    04.L

    umB

    S22

    22.H

    er2

    S20

    91.H

    er2

    S22

    19.B

    asal

    S22

    11.H

    er2

    S20

    81.N

    orm

    al

    S22

    02.H

    er2

    S20

    79.B

    asal

    S24

    79.L

    umB

    S21

    04.N

    orm

    al

    S20

    97.L

    umB

    S23

    93.L

    umB

    −535.501

    S20

    86.B

    asal

    S22

    10.L

    umB

    S24

    61.B

    asal

    S20

    95.L

    umB

    S22

    37.N

    orm

    al

    S22

    04.N

    orm

    al

    S22

    03.N

    orm

    al

    S21

    67.N

    orm

    al

    S24

    53.L

    umA

    S21

    81.L

    umA

    S24

    96.B

    asal

    S24

    68.B

    asal

    S23

    27.B

    asal

    S21

    96.B

    asal

    S21

    83.B

    asal

    S24

    38.H

    er2

    S22

    31.H

    er2

    S20

    83.B

    asal

    S22

    27.B

    asal

    −112.034

    S21

    34.B

    asal

    S24

    02.B

    asal

    S22

    36.L

    umA

    S21

    11.L

    umA

    S24

    19.L

    umA

    S21

    76.H

    er2

    S21

    57.L

    umA

    −1.577●

    S21

    60.H

    er2

    S20

    96.L

    umA

    S21

    86.L

    umA

    S21

    84.L

    umB

    (a) bhc.cn

    010

    2030

    4050

    60

    −3737.47

    −479.70

    −71.22

    S21

    62.B

    asal

    S20

    98.L

    umA

    S20

    96.L

    umA

    S23

    72.L

    umA

    S21

    08.L

    umB

    S21

    56.L

    umA

    S24

    98.L

    umA

    S21

    80.N

    orm

    al

    S24

    20.L

    umA

    S22

    00.L

    umA

    S21

    57.L

    umA

    S21

    10.L

    umA

    S21

    13.L

    umA

    S20

    75.L

    umA

    S20

    78.L

    umA

    S20

    77.L

    umA

    S22

    01.L

    umA

    S22

    30.L

    umA

    S21

    09.N

    orm

    al

    S21

    59.L

    umA

    S21

    63.N

    orm

    al

    −11.08●

    S21

    74.L

    umA

    S21

    11.L

    umA

    S23

    78.H

    er2

    S22

    36.L

    umA

    S21

    42.L

    umA

    S21

    86.L

    umA

    −166.63

    S21

    05.L

    umB

    S21

    98.L

    umA

    S21

    24.L

    umB

    S22

    07.L

    umB

    S20

    80.L

    umB

    S21

    72.L

    umB

    S23

    80.L

    umA

    S21

    25.L

    umB

    S21

    60.H

    er2

    S25

    04.L

    umB

    S24

    56.L

    umB

    S22

    25.L

    umA

    S24

    05.L

    umA

    S23

    74.L

    umA

    S21

    14.L

    umA

    S24

    92.L

    umA

    S24

    99.L

    umA

    −1330.75

    −253.52

    S20

    87.L

    umA

    S20

    81.N

    orm

    al

    S21

    34.B

    asal

    S21

    67.N

    orm

    al

    S23

    27.B

    asal

    S22

    17.B

    asal

    S22

    27.B

    asal

    S24

    02.B

    asal

    S21

    91.L

    umA

    S22

    03.N

    orm

    al

    S24

    19.L

    umA

    S21

    04.N

    orm

    al

    S23

    86.N

    orm

    al

    S21

    76.H

    er2

    S21

    81.L

    umA

    S24

    96.B

    asal

    S22

    04.N

    orm

    al

    S22

    37.N

    orm

    al

    −11.27

    S24

    53.L

    umA

    S20

    95.L

    umB

    S23

    77.L

    umB

    S24

    36.L

    umA

    S24

    79.L

    umB

    S23

    75.L

    umB

    S24

    21.N

    orm

    al

    S21

    49.B

    asal

    S22

    09.B

    asal

    S24

    38.H

    er2

    −386.77

    −1.18

    S21

    84.L

    umB

    S22

    10.L

    umB

    S20

    79.B

    asal

    S22

    18.H

    er2

    S20

    83.B

    asal

    S22

    28.L

    umB

    S21

    71.L

    umB

    S21

    99.L

    umB

    S24

    84.L

    umB

    S21

    00.B

    asal

    S21

    38.L

    umB

    S22

    19.B

    asal

    S23

    93.L

    umB

    S22

    02.H

    er2

    S20

    91.H

    er2

    S22

    11.H

    er2

    S20

    97.L

    umB

    S21

    07.L

    umB

    −75.24

    S22

    29.B

    asal

    S21

    66.B

    asal

    S24

    01.B

    asal

    S20

    86.B

    asal

    S21

    85.B

    asal

    S24

    68.B

    asal

    S21

    96.B

    asal

    S21

    83.B

    asal

    S21

    54.B

    asal

    S24

    61.B

    asal

    −1.18●

    S20

    99.H

    er2

    S22

    33.L

    umA

    S21

    53.B

    asal

    S22

    31.H

    er2

    S24

    62.B

    asal

    S22

    22.H

    er2

    (b) bhc.ge

    Figure 3: Result of bhc() function.

    3.3 Integrative clustering

    We first load the package iCluster and then organize the data for the analysis [2]. The iCluster libraryimplements a sparse clustering method which will select features from CN and GE data for joint clusteringsamples. Here the idea is to generate integrated cluster assignments based on joint inference across multipledata types.

    Load iCluster library and create an input object for iCluster.

    > library(iCluster)

    > datasets datasets[[1]] datasets[[2]] fit iclusters

  • > selected.cn.feature selected.ge.feature selected.feature selected.feature names(selected.feature) write.table(selected.feature, file='./resDir/selectedFeatureByIntClust.txt',

    + sep='\t', quote=FALSE, row.names=FALSE, col.names=FALSE)

    Plot the integrative clustering result using plotCE() function from Dance package.

    > dataCol dataColors CN.data GE.data rownames(CN.data) rownames(GE.data) plotCE(CN.data[selected.cn.feature, ], GE.data[selected.ge.feature, ], dataCol=dataCol,

    + grouping=iclusters, dataColors=dataColors, fileName='./resDir/IntClustOutcome.pdf')

    1q25.31q32.21q32.31q32.11q21.21q23.31q24.11q32.11q221q23.31q42.38q24.131q21.21q21.31q24.21q32.11q42.131q42.21q448q24.111q32.18q22.11q21.21q221q24.21q25.11q25.21q25.31q31.31q42.131q42.28q24.1216q24.31q21.11q32.18q21.21q21.21q21.21q21.31q21.31q228q22.38q23.18q24.1116q22.116q23.21q23.11q23.31q25.21q438q22.316q22.116q24.216q24.31q21.31q23.18q22.18q22.28q22.38q24.1316q1316q1316q24.11q21.11q21.11q21.11q21.31q21.31q228q22.18q24.316q1316q22.31q221q31.28q21.118q21.138q21.38q24.316q1316q1316q22.18q11.238q13.18q24.228q24.221q21.31q31.31q32.11q42.138q13.28q21.38q22.316q12.216q22.116q23.116q23.216q24.21q221q228q21.118q21.1316q23.3

    1q23.31q21.21q42.121q23.31q411q25.38q24.121q21.11q42.111q21.21q21.31q221q23.31q32.11q42.118q22.316q22.116q23.21q42.121q42.131q42.131q42.21q438q22.28q22.38q24.131q24.21q441q448q22.18q22.216q131q21.11q21.31q221q25.31q32.11q42.128q24.1316q131q23.28q21.118q21.138q21.38q22.18q22.316q1316q2116q22.116q22.116q24.31q21.21q21.21q221q32.11q42.131q42.38q11.238q22.18q24.31q21.21q21.21q21.31q23.11q23.21q24.21q25.11q411q42.131q448q13.28q13.28q21.118q22.38q22.316q22.116q23.116q23.216q24.216q24.31q221q221q23.31q23.31q31.11q32.18q11.238q21.118q22.316q22.116q22.116q22.116q22.116q22.316q24.216q24.316q24.3

    Figure 4: Result of plotCE() function.

    How these outcomes from different clustering algorithms differ?

    > source("./dataDir/myfunctions.r")

    > bhc.cn.clusters bhc.ge.clusters clustering1 clustering2 clustering3

  • > data rownames(data) colnames(data) pdf(file='./resDir/coOccuranceClustering.pdf')

    > heatmap(as.matrix(data),

    + distfun=function(x) dist(x, method='manhattan'),

    + hclustfun=function(x) hclust(x, method='ward'),

    + col=c('white','darkblue','darkred','black'),breaks=c(-1, 0.5, 1.5, 2.5, 3.5),

    + scale='none', ColSideColors=pam50Subtype.color)

    > dev.off()

    S21

    09.N

    orm

    alS

    2498

    .Lum

    AS

    2230

    .Lum

    AS

    2159

    .Lum

    AS

    2110

    .Lum

    AS

    2098

    .Lum

    AS

    2078

    .Lum

    AS

    2077

    .Lum

    AS

    2162

    .Bas

    alS

    2075

    .Lum

    AS

    2401

    .Bas

    alS

    2166

    .Bas

    alS

    2229

    .Bas

    alS

    2149

    .Bas

    alS

    2199

    .Lum

    BS

    2233

    .Lum

    AS

    2209

    .Bas

    alS

    2099

    .Her

    2S

    2142

    .Lum

    AS

    2456

    .Lum

    BS

    2114

    .Lum

    AS

    2405

    .Lum

    AS

    2191

    .Lum

    AS

    2185

    .Bas

    alS

    2378

    .Her

    2S

    2153

    .Bas

    alS

    2492

    .Lum

    AS

    2499

    .Lum

    AS

    2172

    .Lum

    BS

    2174

    .Lum

    AS

    2462

    .Bas

    alS

    2372

    .Lum

    AS

    2420

    .Lum

    AS

    2163

    .Nor

    mal

    S21

    80.N

    orm

    alS

    2218

    .Her

    2S

    2100

    .Bas

    alS

    2079

    .Bas

    alS

    2202

    .Her

    2S

    2124

    .Lum

    BS

    2080

    .Lum

    BS

    2375

    .Lum

    BS

    2222

    .Her

    2S

    2479

    .Lum

    BS

    2083

    .Bas

    alS

    2231

    .Her

    2S

    2210

    .Lum

    BS

    2438

    .Her

    2S

    2468

    .Bas

    alS

    2461

    .Bas

    alS

    2196

    .Bas

    alS

    2086

    .Bas

    alS

    2183

    .Bas

    alS

    2217

    .Bas

    alS

    2081

    .Nor

    mal

    S21

    04.N

    orm

    alS

    2237

    .Nor

    mal

    S22

    04.N

    orm

    alS

    2203

    .Nor

    mal

    S24

    96.B

    asal

    S22

    27.B

    asal

    S23

    27.B

    asal

    S21

    56.L

    umA

    S21

    08.L

    umB

    S21

    13.L

    umA

    S22

    01.L

    umA

    S23

    80.L

    umA

    S21

    25.L

    umB

    S21

    98.L

    umA

    S21

    05.L

    umB

    S22

    25.L

    umA

    S23

    74.L

    umA

    S23

    93.L

    umB

    S20

    97.L

    umB

    S22

    11.H

    er2

    S22

    19.B

    asal

    S20

    91.H

    er2

    S21

    07.L

    umB

    S22

    28.L

    umB

    S22

    07.L

    umB

    S21

    38.L

    umB

    S24

    84.L

    umB

    S24

    53.L

    umA

    S20

    95.L

    umB

    S21

    81.L

    umA

    S21

    67.N

    orm

    alS

    2087

    .Lum

    AS

    2386

    .Nor

    mal

    S24

    19.L

    umA

    S21

    76.H

    er2

    S21

    34.B

    asal

    S24

    02.B

    asal

    S21

    60.H

    er2

    S25

    04.L

    umB

    S21

    71.L

    umB

    S24

    21.N

    orm

    alS

    2436

    .Lum

    AS

    2377

    .Lum

    BS

    2157

    .Lum

    AS

    2200

    .Lum

    AS

    2096

    .Lum

    AS

    2111

    .Lum

    AS

    2236

    .Lum

    AS

    2154

    .Bas

    alS

    2186

    .Lum

    AS

    2184

    .Lum

    B

    S2109.NormalS2498.LumAS2230.LumAS2159.LumAS2110.LumAS2098.LumAS2078.LumAS2077.LumAS2162.BasalS2075.LumAS2401.BasalS2166.BasalS2229.BasalS2149.BasalS2199.LumBS2233.LumAS2209.BasalS2099.Her2S2142.LumAS2456.LumBS2114.LumAS2405.LumAS2191.LumAS2185.BasalS2378.Her2S2153.BasalS2492.LumAS2499.LumAS2172.LumBS2174.LumAS2462.BasalS2372.LumAS2420.LumAS2163.NormalS2180.NormalS2218.Her2S2100.BasalS2079.BasalS2202.Her2S2124.LumBS2080.LumBS2375.LumBS2222.Her2S2479.LumBS2083.BasalS2231.Her2S2210.LumBS2438.Her2S2468.BasalS2461.BasalS2196.BasalS2086.BasalS2183.BasalS2217.BasalS2081.NormalS2104.NormalS2237.NormalS2204.NormalS2203.NormalS2496.BasalS2227.BasalS2327.BasalS2156.LumAS2108.LumBS2113.LumAS2201.LumAS2380.LumAS2125.LumBS2198.LumAS2105.LumBS2225.LumAS2374.LumAS2393.LumBS2097.LumBS2211.Her2S2219.BasalS2091.Her2S2107.LumBS2228.LumBS2207.LumBS2138.LumBS2484.LumBS2453.LumAS2095.LumBS2181.LumAS2167.NormalS2087.LumAS2386.NormalS2419.LumAS2176.Her2S2134.BasalS2402.BasalS2160.Her2S2504.LumBS2171.LumBS2421.NormalS2436.LumAS2377.LumBS2157.LumAS2200.LumAS2096.LumAS2111.LumAS2236.LumAS2154.BasalS2186.LumAS2184.LumB

    Figure 5: Result of co-occurance clustering. Blue is co-occurance once, red is twice and black is three times.

    Discussions

    Only CN data from chromosome 1 and 8 are chosen in the iCluster output, why? And what happenswhen the penalty (lambda) is set lower?

    Setting the parameters for making calls from segmented CN data can be tricky in CGHcall. The minimumsegment to be fit has to be adjusted according to the probe designed on the array.

    Co-occurance matrix enables comparisons across many clustering outcomes, but how many cluster shouldbe there in this dataset, and does the POD score tell the same story?

    8

  • Homework: Is there an optimal number of clusters?

    Compute scores of iCluster using different number of clusters k.

    # score.pod

  • Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mB

    Lum

    BLu

    mB

    Bas

    alH

    er2

    Her

    2Lu

    mA

    Lum

    BLu

    mB

    Lum

    BLu

    mB

    Lum

    BLu

    mB

    Lum

    BB

    asal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Her

    2H

    er2

    Her

    2H

    er2

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mB

    Lum

    BLu

    mB

    Lum

    BLu

    mB

    Nor

    mal

    Nor

    mal

    Nor

    mal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Bas

    alH

    er2

    Her

    2H

    er2

    Her

    2H

    er2

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mB

    Lum

    BLu

    mB

    Lum

    BLu

    mB

    Lum

    BLu

    mB

    Lum

    BN

    orm

    alN

    orm

    alN

    orm

    alN

    orm

    alN

    orm

    alN

    orm

    alN

    orm

    alN

    orm

    al

    NormalNormalNormalNormalNormalNormalNormalNormal

    LumBLumBLumBLumBLumBLumBLumBLumBLumALumALumALumALumALumALumALumALumALumALumAHer2Her2Her2Her2Her2

    BasalBasalBasalBasalBasalBasalBasalBasalBasalBasalBasalBasalBasalNormal

    NormalNormalLumBLumBLumBLumBLumBLumALumALumALumALumALumALumALumALumALumALumALumALumALumALumALumALumAHer2Her2Her2Her2

    BasalBasalBasalBasalBasalBasalBasalBasalBasalBasalBasalLumBLumBLumBLumBLumBLumBLumBLumAHer2Her2Basal

    LumBLumBLumBLumALumALumALumALumALumALumA

    K=4

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    BLu

    mB

    Lum

    BB

    asal

    Her

    2H

    er2

    Lum

    ALu

    mB

    Lum

    BLu

    mB

    Lum

    BLu

    mB

    Lum

    BLu

    mB

    Bas

    alB

    asal

    Her

    2Lu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    BLu

    mB

    Lum

    BN

    orm

    alN

    orm

    alN

    orm

    alB

    asal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Bas

    alH

    er2

    Her

    2H

    er2

    Her

    2Lu

    mB

    Lum

    BLu

    mB

    Lum

    BLu

    mB

    Nor

    mal

    Nor

    mal

    Nor

    mal

    Nor

    mal

    Nor

    mal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Her

    2H

    er2

    Her

    2H

    er2

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    BLu

    mB

    Lum

    BLu

    mB

    Lum

    BN

    orm

    alN

    orm

    alN

    orm

    al

    NormalNormalNormalLumB

    LumBLumBLumBLumBLumALumALumALumALumALumALumALumALumALumALumALumALumALumALumALumALumALumAHer2Her2Her2Her2

    BasalBasalBasalBasalBasalBasalBasalBasalBasalBasal

    NormalNormalNormalNormalNormalLumB

    LumBLumBLumBLumBHer2Her2Her2Her2

    BasalBasalBasalBasalBasalBasalBasalBasalBasalBasalBasalBasal

    NormalNormalNormalLumB

    LumBLumBLumALumALumALumALumALumALumALumALumALumALumAHer2BasalBasalLumBLumBLumBLumBLumBLumBLumBLumAHer2Her2

    BasalLumBLumBLumBLumALumALumALumALumALumA

    K=5Lu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mB

    Lum

    BLu

    mB

    Bas

    alH

    er2

    Her

    2Lu

    mA

    Lum

    BLu

    mB

    Lum

    BLu

    mB

    Lum

    BLu

    mB

    Lum

    BB

    asal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Her

    2H

    er2

    Lum

    BLu

    mB

    Lum

    BLu

    mB

    Bas

    alB

    asal

    Her

    2Lu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    BLu

    mB

    Lum

    BN

    orm

    alN

    orm

    alN

    orm

    alB

    asal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Her

    2H

    er2

    Her

    2H

    er2

    Lum

    BLu

    mB

    Lum

    BLu

    mB

    Lum

    BN

    orm

    alN

    orm

    alN

    orm

    alN

    orm

    alN

    orm

    alB

    asal

    Bas

    alH

    er2

    Her

    2Lu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mB

    Nor

    mal

    Nor

    mal

    Nor

    mal

    NormalNormalNormalLumB

    LumALumALumALumALumALumALumALumALumALumALumALumALumALumALumALumALumALumAHer2Her2

    BasalBasalNormalNormalNormalNormalNormalLumB

    LumBLumBLumBLumBHer2Her2Her2Her2

    BasalBasalBasalBasalBasalBasalBasalBasalBasalNormal

    NormalNormalLumBLumBLumBLumALumALumALumALumALumALumALumALumALumALumAHer2Basal

    BasalLumBLumBLumBLumBHer2Her2Basal

    BasalBasalBasalBasalBasalBasalBasalBasalBasalBasalLumBLumBLumBLumBLumBLumBLumBLumAHer2Her2

    BasalLumBLumBLumBLumALumALumALumALumALumA

    K=6

    Her

    2Lu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mB

    Lum

    BLu

    mB

    Bas

    alH

    er2

    Her

    2Lu

    mA

    Lum

    BLu

    mB

    Lum

    BLu

    mB

    Lum

    BLu

    mB

    Lum

    BB

    asal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Bas

    alH

    er2

    Her

    2Lu

    mB

    Lum

    BLu

    mB

    Lum

    BB

    asal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Bas

    alB

    asal

    Bas

    alH

    er2

    Her

    2H

    er2

    Lum

    BLu

    mB

    Nor

    mal

    Nor

    mal

    Nor

    mal

    Nor

    mal

    Bas

    alB

    asal

    Her

    2Lu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    BLu

    mB

    Lum

    BLu

    mB

    Lum

    BLu

    mB

    Nor

    mal

    Nor

    mal

    Nor

    mal

    Nor

    mal

    Bas

    alB

    asal

    Her

    2H

    er2

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    ALu

    mA

    Lum

    BN

    orm

    alN

    orm

    alN

    orm

    al

    NormalNormalNormalLumB

    LumALumALumALumALumALumALumALumALumALumALumALumALumALumALumALumALumALumAHer2Her2

    BasalBasalNormalNormalNormalNormal

    LumBLumBLumBLumBLumBLumBLumALumALumALumALumALumALumALumALumALumALumAHer2BasalBasal

    NormalNormalNormalNormal

    LumBLumBHer2Her2Her2Basal

    BasalBasalBasalBasalBasalBasalBasalBasalBasalLumBLumBLumBLumBHer2Her2Basal

    BasalBasalBasalBasalBasalBasalBasalBasalBasalLumBLumBLumBLumBLumBLumBLumBLumAHer2Her2Basal

    LumBLumBLumBLumALumALumALumALumALumAHer2

    K=7

    (a) k clusters

    ●●

    ● ●

    4.0 4.5 5.0 5.5 6.0 6.5 7.0

    0.19

    0.24

    4:7

    scor

    e.po

    d

    (b) iCluster scores

    Figure 6: Result of plotiCluster() function.

    10

  • R version 2.12.0 (2010-10-15)

    Platform: i386-pc-mingw32/i386 (32-bit)

    attached base packages:

    [1] splines grid stats graphics grDevices utils datasets

    [8] methods base

    other attached packages:

    [1] iCluster_1.2.0 corpcor_1.5.7 penalized_0.9-32

    [4] survival_2.35-8 DLBCL_1.3.1 snow_0.3-3

    [7] Dance_0.9 IGIR_0.1 igraph_0.5.4-2

    [10] KEGG.db_2.4.5 RSQLite_0.9-2 DBI_0.2-5

    [13] Matrix_0.999375-44 lol_0.5 CGHregions_1.8.0

    [16] CGHbase_1.8.0 marray_1.27.0 limma_3.5.21

    [19] qvalue_1.24.0 samr_1.28 impute_1.22.0

    [22] biomaRt_2.5.1 HTSanalyzeR_1.2.5 RankProd_2.22.0

    [25] cellHTS2_2.14.0 locfit_1.5-6 lattice_0.19-13

    [28] akima_0.5-4 hwriter_1.2 vsn_3.17.2

    [31] splots_1.16.0 genefilter_1.31.2 RColorBrewer_1.0-2

    [34] BioNet_1.8.0 RBGL_1.25.1 GSEABase_1.12.0

    [37] graph_1.28.0 annotate_1.27.3 AnnotationDbi_1.11.10

    [40] Biobase_2.9.2 R.utils_1.5.3 R.oo_1.7.4

    [43] R.methodsS3_1.2.1

    loaded via a namespace (and not attached):

    [1] affy_1.28.0 affyio_1.18.0 Category_2.16.0

    [4] MASS_7.3-7 prada_1.26.0 preprocessCore_1.12.0

    [7] RCurl_1.4-4.1 rrcov_1.1-00 stats4_2.12.0

    [10] tcltk_2.12.0 tools_2.12.0 XML_3.2-0.1

    [13] xtable_1.5-6

    11