Analyzing the NCI-60 Cancer Cell Lines using Data Obtained

17
1 Analyzing the NCI-60 Cancer Cell Lines using Data Obtained from Genome-Wide ChIP-X Experiments Jayanth (Jay) Krishnan Mahopac High School January 25, 2010

Transcript of Analyzing the NCI-60 Cancer Cell Lines using Data Obtained

1

Analyzing the NCI-60 Cancer Cell Lines using Data Obtained fromGenome-Wide ChIP-X Experiments

Jayanth (Jay) Krishnan

Mahopac High School

January 25, 2010

2

Analyzing the NCI-60 Cancer Cell Lines using Data Obtained fromGenome-Wide ChIP-X Experiments

Jayanth Krishnan, 32 Country Knolls Ln, Mahopac, NY 10541Mahopac High School, Mahopac, NYTeacher and/or Mentor: Mr. Mark Langella/Dr. Avi Ma’ayan

The application of data obtained from high throughput ChIP-on-chip and ChIP-Seqexperiments to analyze the NCI-60 Cancer Cell Lines can allow for the identification ofthe regulatory program responsible for the aberrant expression of genes in the differenttypes of cancer. Using input data from a published resource specific to the NCI-60Cancer Cell Lines, a system can be developed to identify mis-regulated transcriptionfactors that potentially regulate transcription factor activity. These transcription factorscan be organized, with the development of cancer specific networks of the regulatoryinteraction data, to determine the activity of transcription factors that is influential to cellactivity specific to each one of the NCI-60 cell lines. With networks and lists of specifictranscription factors involved with regulating each cancer cell line, experimentalbiologists can design functional experiments targeting the specific transcription factorsidentified by this project. We propose that the combination and integration of ChIP-Xdata with the NCI-60 mRNA expression data is useful in obtaining a comprehensiveunderstanding of the molecular mechanisms responsible for aberrant expression indifferent cancers.

3

ACKNOWLEDGEMENTS

First of all, I would like to thank my mentor Dr. Avi Ma’ayan and the Department

of Pharmacology and Systems Therapeutics of Mount Sinai School of Medicine for

giving me this research opportunity and for allowing me to use their premises and

funding. I would also like to thank my dedicated teacher Mr. Mark Langella, who guided

me in the right direction and consistently encouraged and motivated me.

I would like to acknowledge the National Institutes of Health (Grant No.

P50GM071558 -- Seed fund, Mount Sinai School of Medicine (to A.M.)) for funding the

research project.

My sincere thanks also go to my school district and to my Principal, Mr. Bilyeu,

who supported the science research program for our school.

I would also like to thank Dr. Darzynkiewicz who was my first mentor giving me a

research lab experience on Apoptosis and for all the encouragement he gave me to get

involved with science research.

Last but not the least; I would like to thank my parents Mr. Krishnan Sugavanam

and Mrs. Mythily Krishnan for the motivation and moral support that they gave me.

4

TABLE OF CONTENTS

Statement of Purpose…………………………………………………………………5

List of Tables…………………………………………………………………………6

List of Figures………………………………………………………………………...6

Introduction…………………………………………………………………………...7

Materials and Methods………………………………………………………………..8

Results and Analysis…………………………………………………………………10

References………………………………………………………………...………….16

5

Statement of Purpose:

To create a system that will identify mis-regulated transcription factors that may

regulate transcription factor activity of cancerous cells. This can be done using ChIP-on

chip, ChIP-Seq, and ChIP-PET data from published papers and the mRNA expression

data for the NCI-60 Cancer Cell Lines available using CellMiner.

The system and data obtained will allow for the study and identification of the

regulatory program responsible for the aberrant expression of genes in the different types

of cancer.

The study will further allow the development of cancer specific networks of the

regulatory interaction data that can be used to determine the activity of transcription

factors that is influential to cell activity specific to each of the cancer cell lines.

6

LIST OF TABLES

Table 1: Over expressed genes for the cancer cell line ME:MDA_N……..11

Table 2: Transcription factors influencing over expressed genes ofthe cancer cell line ME:MDA_N……………………………...…12

Table 3: Signaling pathways of the intermediates linked to the genesregulated by transcription factor GABP………………………….15

LIST OF FIGURES

Figure 1: Genes2Networks diagram of GABP genes andsignificant intermediates……………………………………...….14

7

Introduction

ChIP-on-chip is a technique that is used to investigate Protein-DNA interactions.

This technique combines elements from chromatin immunoprecipitation (ChIP) with

microarray technology (chip). ChIP-on-chip is used in order to identify where the

binding sites are for a given protein. Once these binding sites are found, scientists may

have the opportunity to gain knowledge on the functions of various transcriptional

regulators and perhaps identify different functional elements which include promoters

and enhancer sequences that control DNA replication. The discovery of the binding sites

of the model organism budding yeast or Saccharomyces cerevisiae was when the ChIP-

on-chip technique was first successfully used (1. Michael Snyder et al, 2002).

ChIP-Seq (5. Nature methods, 2009), very much like ChIP-on-chip, combines

chromatin immunoprecipitation (ChIP) and DNA sequencing to identify targets for DNA

associated proteins. ChIP-Seq is more sensitive, cheaper, and quicker.

The NCI-60 Cancer Cell Lines (6. DTP 2009) are 60 cancer cell lines from

various human tissues including the bone marrow, breast, colon, kidney, lung, ovary,

prostate and skin. These cells have been profiled in regards to gene expression,

chromosome karyotyping, and proteomics status. Research in the field of pharmacology

is also being done. For this project we combined data from ChIP-chip and ChIP-seq and

use it to analyze mRNA expression profiling of the NCI-60 cell lines.

8

Materials and Methods

In Phase 1, ChIP-on-chip, ChIP-seq and ChIP-PET data will be gathered from

prior experiments. Extraction of data from the supplemental Excel spreadsheets and PDF

tables will be performed. Phase 1 is essential for the creation of a database, and a system,

that will be used to identify transcription factors specific for a gene list identified from

the NCI-60 Cancer Cell Lines.

In Phase 1, a database of the mammalian ChIP data will be created that consists of

various transcription factors and the respective genes which they bind to their proximity

of their promoter. The database and system created will be entitled ChIP Enrichment

Analysis (ChEA) (7.Lachmann et. al. 2009) and will contain over 100,000 interactions

extracted from over 60 publications. The analysis will include over 80 transcription

factors and the thousands of target genes which they potentially regulate. The

accumulated data will be manipulated so that it can later be made into a user friendly

system. The database being created will convert all gene identifiers into official gene

symbols; manipulation includes translating RIK, Swiss-Prot as well as many other

synonyms into their official gene names etc. In order to accomplish this phase,

crosschecking between the genes gathered in the database against all the genes in

humans, mice and rat will be done. A Perl based program will be used to accomplish this.

In Phase 2, once the database has been created and presented in a specific format,

a computer program can be written creating a system which allows one to insert genes

and retrieve the corresponding ranked list of transcription factors. For this project,

differentially expressed gene lists from the various NCI-60 Cancer Cell Lines are used as

9

the gene input. These genes will be identified by using statistical techniques on the

NCI60 database. When the system, entitled ChEA, has been created, it will be capable of

computing an over-representation for targets of transcription factors from the ChIP

database. Simply stated, ChEA can generate a list of transcription factors which have

overlapping target genes when compared to the input target list.

Lastly in Phase 3 of the project, these genes can then be connected together in a network

based on protein-protein interactions. In addition, the gene lists from the NCI-60 Cancer

Cell Lines can be inserted in a program such as kinase enrichment analysis (KEA)

(8.Ma’ayan et al. 2009) which can generate a list of specific kinases that can be identified

for each of the NCI Cancer Cell Line. KEA accomplishes this task by computing how the

proportion of kinases, associated with a specific list of proteins/genes, deviates from an

expected distribution. Then kinases and kinase families can be ranked based on the

assumption that these kinases are functionally associated with regulating the cell under

specific experimental conditions.

Networks connecting mis-regulated transcription Factors will be created via the

use of the yED software. (9. Y-Works)

10

Results

The NCI-60 Cancer Cell Line data from CellMiner (10.CellMiner 2009) gives

data in many different forms. For this project, we extracted the normalized Affy HG-

U133A mRNA data. Since all cancer types can express all genes, this data presented us

with gene expression values for each gene – cancer line pair using several experimental

probes. The mean and standard deviation for the expression of each gene across all

experiments was computed. Then for each cancer cell line, a test statistic was computed

and a two sided T test was used to determine whether the gene is statistically over

expressed or under expressed. A program was written (using Perl) to process the data

from the NCI-60 database. This program did the following steps.

1. Parsed the file and found 18133 unique genes

2. Found the population mean and sigma for the expression of each of the genes

across all the 60 cancer cell lines

3. Found the sample mean and estimated sigma for each (gene, cancer cell line) pair

4. Determined a test statistic for each (gene, cancer cell line) pair.

5. Determined whether the gene was over expressed or under expressed by checking

whether the test statistic exceeded a critical T score or was a less than a critical T

score determined based on a particular P value.

6. Found the list of genes which are over / under expressed for multiple cancer cell

lines

An alpha value of +/- 0.025 was chosen for the double sided T test. Using this alpha

value, the program reported around 200 over / under expressed genes for each cancer cell

11

line. Using the above program in conjunction with ChEA [7. Lachmann et al], two of the

results that were obtained are illustrated below.

1. Determination of the most influential transcription factors of a particular cancer

cell line

Melanoma clusters include two cancer cell lines. MDA-MB-435 and MDA_N which

were originally thought to be breast cancers. MDA-MB-435 is absolutely characteristic of

melanotic melanoma in all molecular profiling studies [11. Shankavaram et al]. MDA_N

is an ErbB2 transfectant of MDA-MB-435. The following 222 over expressed genes were

reported for the cancer line ME:MDA_N.

PLD1 SART1 WARS DNAJB1 M6PR GPNMB SREBF2 SEC11ARAE1 GPAA1 SLC29A1 IL13RA1 TRAK2 AKR7A2 USF2 CNOT8CHPF GAS7 CUL4B SLC6A8 CSNK1E GTF2H1 PCOLCE CHMP2B

PRKCD DHPS SLC37A4 TCTA TIMP2 STX7 SNTA1 PTPN18CTSF CD302 KAT2B EDNRB NOV HPS5 SMCR7L ACP5GJB1 AP1S1 ATP7A EIF2S3 DCT DYNC1I1 BCHE GSTT2

NPAS2 SCRG1 TRAIP UBFD1 TYRP1 TRPM2 DDX18 SLC4A3HBE1 RXRG PLXNB3 KCNS3 S100A3 ASPA SLC22A18AS PPP2R4TYR GLRX HCG4 GPR143 DGKI CGGBP1 CSRP2 SLC25A11

STAU1 MAGEA1 CTAG2 ZNF200 CAST PDIA6 BEST1 MCM7HLA-DRA TXN2 TNFRSF14 FAM3A RAB27A SLC1A4 KHDRBS3 ART3

PLP1 HSP90AA1 MAGEA12 CTAG1B NRP2 CAPN3 GYG2 DLATSPIN2B C1orf144 SLC25A6 ZCCHC24 SUCLG2 TUBB4 RFNG MORC3

C14orf109 CLCN2 C9orf61 CCPG1 MAGEA2B MAGEA5 AZI1 UAP1L1DPY19L2P2 DUSP10 SLC6A10P SLC5A4 GK3P SFXN3 HLA-DMA ALDH18A1

BACE2 MUL1 METT11D1 MRPS18A KLF11 ARHGEF3 FAHD2A RINT1CRIPT MTO1 HEY1 FBXL15 CSGALNACT1 ITIH5 CA14 C14orf139

CADPS2 PRR7 GAL3ST4 NUDT11 CEP97 LONRF3 TINF2 TP53TG3MGAT4B FAM86C ROPN1B C20orf30 TRIM48 TH1L C5orf54 CDCA3

RPL23AP7 MICALL1 LDLRAP1 C17orf90 LUZP1 C3orf64 GPR177 COL9A1LOC348926 PCOTH FAM86B1 LPCAT2 SURF4 XPO5 PDXP COPG2

HAGHL TNFSF13B FAM167B SPRYD5 DGAT2 C2orf30 C6orf89 UBL7ULK3 TOMM40L FAM160B1 SNX30 TMEM55A GGT7 C12orf34 C3orf38

HDAC10 LOC400657 AFAP1L1 FAM125A OLIG1 C11orf82 ENHO CITED4HSD11B1L SCARNA15 SMYD4 LOC153364 CAMK2N2 PAGE2 LOC730124 GBGT1

CHRM1 AARS2 ANKRD54 KIAA1524 KIAA1586 ZC3H12C FSTL5 CLEC2LGPR158 SLITRK4 GNASAS ELOVL3 ST6GALNAC3 LRRC33 NPHP3 HMCN1RNF175 C5orf35 LOC147645 LOC730259 TMEM171 DLX1

Table 1: Over expressed genes for the cancer cell line ME:MDA_N

12

Next, we used the ChEA data base [7. Lachmann et al] and found the transcription factorsinfluencing genes from the 222 above genes sorted by ascending p values.

Table 2: Transcription factors influencing over expressed genes of ME:MDA_N

13

The transcription factors at the top of the above list (with low p values) are the ones

which most influence the cancer cell line ME:MDA_N.

The highest ranked transcription factor in the above table is GABP which is linked to 43

of the 222 over expressed genes.

During the cell cycle, the four-phase process of cell division, there is a period when the

biochemical brakes are put on and cells become inactive. Then the process is kick-started

and cells move into the so-called S phase, when DNA is duplicated. This is a critical

juncture. If genes are missing or broken, these alterations are passed on to the new cell --

and could result in disability or in diseases such as cancer. A team from Brown

University has recently confirmed GABP's critical role in cell growth. By simply forcing

dormant cells to make GABP, they found, was enough to rouse cells from their slumber

and get them to grow again [12. Rosemarin et al]. GABP is now recognized to be a key

transcriptional regulator of dynamically regulated, lineage-restricted genes, especially in

myeloid cells and at the neuromuscular junction. However, the role of GABP in

ME:MDA_N cell line was not previously considered. Hence, our analysis proposes to

focus on this factor to explain changes in expression observed in this cancer.

2. Construction of a gene interaction network of the 43 over expressed genes

regulated by GABP.

We used Genes2Networks [13. Berger et al] to study the interaction of 43 genes

identified as potentially regulated by GABP. Figure 1 below shows the output of the

Genes2Networks application.

14

Figure 1: Genes2Networks diagram of GABP target genes and significant intermediates

The application identified 13 interactions among 5 of the input genes ( XPO5, RAE1,

HSP90AA1, PPP2R4, MCM7). It also identified 8 intermediate nodes, (CDK7, NUP98,

CSNK2A1, NUP214, RBL2, FHL2, MYOD1, AR). Intermediate nodes may be specific

to interact with components from the inputted seed list. If these intermediates are

specific, it may be beneficial for the user to identify them as potential specific regulators

and specific participants in pathways, protein complexes and modules involving the input

seed list. Interestingly, three intermediate genes recovered by Genes2Networks are

components of the Androgen Receptor (AR) pathway including AR itself. The nuclear

receptor AR is mostly known to be mis-regulated in prostate cancer. However, here it

appears to also be involved in melanoma.

15

Table 3: Signalingpathways of theintermediates linked tothe genes regulated byGABP

Conclusions and Future research

Depending on the type of cancer a scientist is studying, the genes of interest that are

either over or under expressed can be identified using statistical analysis. Using this data,

ChEA gives the transcription factors of interest that most likely are responsible for the

observed expression changes which are ranked according to p values. The top ranked

transcription factors (with low p values) are the ones which most influence the cancer cell

lines. Moreover, a protein interaction network can be reconstructed and regulators /

specific participants in pathways and protein complexes can be identified. Such networks

can be used to study the regulation of gene expression by the transcriptional machinery as

this is central for understanding mammalian cellular regulation. Here we identify novel

mechanisms for dis-regulation of melanoma cells by the transcription factor GABP that

may regulate aberrant activity of the AR pathway.

Future research involves constructing gene and protein interaction networks for all the

principal cancer cell lines and identifying the transcription factors that regulate their

expression. Additionally, we can combine such data with data collected for drug

perturbation of these cells to suggest which drugs can reverse the observed changes.

Intermediates Signaling Pathway Linked Hubs andgenes from input

CDK7 Androgen Receptor AR, MCM7FHL2 Androgen Receptor AR, MCM7NUP214 TGF-beta Receptor NUP98, XPO5RBL2 Id Signaling Pathway MYOD1, MCM7AR (NR3C4) Androgen Receptor CDK7, FHL2,

HSP90AA1CSNK2A1 Wnt PPP2R4, HSP90AA1NUP98 Biochemical pathway

using Tyrosine KinaseRAE1, NUP214

MYOD1 Id Signaling Pathway RBL2, HSP90AA1

16

References

1. Horak C, Snyder M. ChIP chip: a genomic approach for identifyingtranscription factor binding sites. Methods Enzymol. 2002;350: 469-484.

2. "Genome.gov | DNA Microarray Fact Sheet." Genome.gov | National HumanGenome Research Institute. Web. 01 June 2010.<http://www.genome.gov/10000533>.

3. "GCAT DNA Chip Wet and Dry Lab Simulations." Biology @ Davidson. Web.01 June 2010.<http://www.bio.davidson.edu/projects/gcat/HSChips/HSchips.html>.

4. MacArthur, Ben D., Avi Ma'ayan, and Ihor R. Lemischka. "Systems biology ofStem cell fate and cellular reprogramming." Nature. Print.

5. "Access : ChIP-seq: welcome to the new frontier : Nature Methods." NaturePublishing Group : science journals, jobs, and information. Web. 02 Jan. 2010.<http://www.nature.com/nmeth/journal/v4/n8/full/nmeth0807-613.html>.

6. "DTP - Cell Lines in the In Vitro Screen." Developmental TherapeuticsProgram NCI/NIH. Web. 10 Nov. 2009.<http://dtp.nci.nih.gov/docs/misc/common_files/cell_list.html>.

7. Lachmann, Alexander, Jayanth Krishnan, and Avi Ma'ayan. "ChEA:Transcription Factor Regulation Inferred from Integrating High-Content ChIPExperiments." 2009. MS. Mount Sinai School of Medicine, New York.

8. Lachmann, Alexander, and Avi Ma'ayan. "KEA: kinase enrichment analysis."Bionformatics (2009). Print.

9. Y Works. YED. Computer software. YED. Vers. 3.4.1. Web.<http://www.yworks.com/en/products_yed_about.html>.

10. "CellMiner - Home." The Genomics and Bioinformatics Group. Web. 01 Jan.2010. <http://discover.nci.nih.gov/cellminer/home.do>.

11. Shankavaram U, Reinhold WC, Nishizuka S, et al. Transcript and proteinexpression profiles of the NCI-60 cancer cell panel: an integromic microarraystudy. Mol Cancer Ther 2007;6:820–32.

12. GA-binding protein transcription factor: a review of GABP as an integrator ofintracellular signaling and protein-protein interactions. Rosmarin AG, ResendesKK, Yang Z, McMillan JN, Fleming SL. PMID: 14757430

17

13. Genes2Networks: connecting lists of gene symbols using mammalian proteininteractions databases. Berger SI, Posner JM, Ma'ayan A. Department ofPharmacology and Systems Therapeutics, Mount Sinai School of Medicine, 1425Madison Avenue, New York, 10029, New York, USA. [email protected]