While gene expression data is widely available describing mRNA levels in different cancer cells...

1
While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible for these changes are still poorly understood. Here we developed a rationale approach to infer regulatory mechanisms governing changes in gene expression by integrating datasets of protein/DNA interactions, protein-protein interactions and kinase-substrate interactions collected from prior biological knowledge. We first utilize data obtained from genome-wide ChIP-on- chip and ChIP-Seq experiments to connect mRNA expression levels of the NCI-60 cancer cell lines to the transcription factors most likely regulating them. These identified transcription factors are then “connected”, using known protein-protein interactions, to form cancer specific sub-networks. Within these sub-networks we assess the enrichment for protein kinase substrates to infer the protein kinases likely regulating these complexes. Finally, using quantitative comparison of the up and down regulated genes for each cancer cell line, and genes affected by FDA approved drugs applied to cancer cells, we predict the mechanisms of action of these drugs. Following this path, from changes in gene expression to transcription factors to protein kinases we can provide a more thorough understanding of the regulatory mechanisms behind the observed mRNA levels in the NCI-60 cancer cell lines and other cancer cells. This approach proposes mechanisms of action for drugs. Wet lab experimental validation of this approach is still necessary, it can be done using single drugs or combinations of them. ChEA Genes2Networks KEA This research was supported by NIH Grant No. 5P50GM071558 Regulatory Signatures of Cancer Cell Lines Regulatory Signatures of Cancer Cell Lines Inferred from Expression Data Inferred from Expression Data Jayanth (Jay) Krishnan 1,2 , Avi Ma’ayan 2 1 1 Mahopac High School, Mahopac, NY 10541 Mahopac High School, Mahopac, NY 10541 2 2 2 Systems Biology Center New York and Department of Pharmacology and Systems Systems Biology Center New York and Department of Pharmacology and Systems Therapeutics, Mount Sinai School of Medicine, New York NY Therapeutics, Mount Sinai School of Medicine, New York NY Abstract Abstract Acknowledgements Introduction Workflow • The NCI-60 database provides mRNA profiles from microarray experiments of 60 commonly studies cancer cell lines • Although analyzing these mRNA values is a reliable method to measure the mRNA level of many genes within a cell, this method offers little clues about how cells are regulated • While mRNA profiles indicates changes caused by cancer, understanding the underlying regulatory mechanisms disregulated in different cancers will bring us closer to therapeutics • In this project we aim to identify the transcription factors, protein complexes and protein kinases responsible for the aberrant expression of genes in the various types of cancer cell lines • Differentially expressed gene lists from the various NCI-60 Cancer Cell Lines are used as input. • Over expressed and under expressed genes are identified for specific cancer cell lines • The following algorithm was implemented: The NCI-60 database was parsed and 18,133 unique genes were identified The population mean for the expression of each of the genes across all the 60 cancer cell lines was calculated The sample mean and sigma for each (gene, cancer cell line) pair was calculated The two-sided T-test statistic was applied for each (gene, cancer cell line) pair. Whether the gene was over expressed or under expressed was calculated by checking whether the test statistic exceeded a critical T score or was a less than a critical T score determined based on a particular P value. A list of genes which are over/under expressed for multiple cancer cell lines was developed Analyzing the mRNA profile from the NCI-60 database Statistical Methods: ChEA, Genes2Networks, and KEA are all web-based tools developed at the Ma’ayan lab to allow users to predict which transcription factors, protein sub- networks, and protein kinases are most correlated with their inputted seed list By using the identified up and down regulated genes for each cancer cell line as an input for ChEA; the top ranked transcription factors (based on p-value from Fisher’s Exact Test) that most likely influence the input seed list are given as the output Future Research Future research involves further analyzing other cancer datasets • Cluster analysis will be done to groups transcription factors or kinases that were identified • Additionally, by combining such data with data collected for drug perturbation of these cells, we may be able to suggest which drugs can reverse the observed changes Cancercelllines Probes 1 2 3 c 60 1 M1,1 M1,2 M 1,3 M 1,c M 1,60 2 M 2, 1 M2,2 M 2,3 M 2,c M 2,60 n Mn,1 Mn,2 M n,3 M n,c M n,60 Population m ean µ = M i,j/(n * 60) i=1,n;j=1,60 Sam ple m ean ofgene expressionsforcancercellline “c” = x ¯ = M i,c /n i=1,n Std deviation ofgene expressionsforcancercellline “c” = s = sqrt( ( X i– x ¯ ) 2 /(n-1)) i=1,n Teststatistic = (xbar– µ)* sqrt(n)/ s Genes2Networks •The transcription factor output for each cancer cell line from ChEA is used as an input to Genes2Networks • Genes2Networks connects lists of transcription factors with other protein intermediates from mammalian protein interactions databases KEA •The unique protein sub-networks outputted by Genes2Networks can then be inputted into KEA which identifies protein kinases most likely regulating the proteins from the subnetwork using the Fisher’s Exact Test. • At this stage top regulating transcription factors, protein sub- networks and kinases have been identified for each of the NCI-60 cancer cell lines • An integrated matrix can now be created in order to holistically compare the data by displaying the top regulating elements and their putative effects on the different cell lines Example of Process Microarray Analyze mRNA profile from NCI - 60 database by using statistical techniques to compute over/under expressed genes Identify protein sub- networks that “connect” the transcription factors through additional proteins Wet lab experimental validation Future Research Identify protein sub- networks that “connect” the transcription factors through additional proteins Wet lab experimental validation Top ranked protein kinases most likely regulating the protein sub- networks PLD1 SART1 W ARS DNAJB1 M 6PR GPNMB SREBF2 SEC11A RAE1 GPAA1 SLC29A1 IL13R A1 TRAK2 AKR7A2 USF2 CNOT8 CHPF GAS7 C U L4B SLC6A8 CSNK1E G TF2H 1 P C O LC E C H M P 2B PRKCD DHPS SLC37A4 TC TA TIMP2 STX7 SNTA1 PTPN18 CTSF C D 302 KAT2B EDNRB NOV HPS5 SMCR7L ACP5 G JB 1 AP1S1 ATP7A EIF2S3 DCT DYNC1I1 BCHE G STT2 NPAS2 SCRG1 TR AIP UBFD1 TYRP1 TRPM2 DDX18 SLC4A3 HBE1 RXRG PLXNB3 KCNS3 S100A3 ASPA SLC 22A18AS PPP2R4 TYR G LR X HCG4 G P R 143 DGKI CGGBP1 CSRP2 SLC 25A11 STAU1 MAGEA1 CTAG2 ZNF200 CAST PD IA6 BEST1 MCM7 H LA-D R A TXN2 TNFRSF14 FAM 3A RAB27A SLC1A4 KHDRBS3 ART3 PLP1 HSP90AA1 M AGEA12 C TA G 1B NRP2 CAPN3 GYG2 DLAT SPIN 2B C1orf144 SLC25A6 ZCCHC24 SUCLG2 TUBB4 RFNG MORC3 C14orf109 C LC N 2 C9orf61 CCPG1 M AGEA2B MAGEA5 AZI1 UAP1L1 DPY19L2P2 DUSP10 S LC 6A10P SLC5A4 G K3P SFXN3 HLA-DM A ALDH18A1 BACE2 M U L1 M ETT11D 1 M RPS18A KLF11 ARHGEF3 FAHD2A RINT1 CRIPT M TO 1 HEY1 FBXL15 CSGALNACT1 ITIH5 CA14 C14orf139 CADPS2 PRR7 G AL3ST4 N U D T11 CEP97 LO N R F3 TINF2 TP53TG 3 M G AT4B FA M 86C ROPN1B C20orf30 TRIM 48 TH 1L C5orf54 CDCA3 RPL23AP7 M ICALL1 LDLRAP1 C17orf90 LUZP1 C3orf64 G P R 177 C O L9A1 LO C 348926 PCOTH FAM 86B1 LPCAT2 SURF4 XPO5 PDXP COPG2 HAGHL TNFSF13B FAM 167B SPRYD5 DGAT2 C2orf30 C6orf89 UBL7 U LK 3 TO M M 40L FAM 160B1 SNX30 TM EM 55A G G T7 C12orf34 C3orf38 HDAC10 LO C 400657 AFAP1L1 FA M 125A OLIG1 C11orf82 ENHO C ITED 4 HSD11B1L SCARNA15 SMYD4 LO C 153364 CAMK2N2 PAGE2 LO C 730124 GBGT1 CHRM1 AARS2 ANKRD54 KIAA1524 KIAA1586 ZC 3H 12C FSTL5 CLEC2L G P R 158 SLITR K 4 GNASAS ELO VL3 ST6GALNAC3 LR R C 33 NPHP3 HMCN1 R N F175 C5orf35 LO C 147645 LO C 730259 TM EM 171 DLX1 Top 222 over expressed genes for cancer cell line MDA_N (melanoma) With gene input, ChEA identified the top ranked transcription factors Genes2Networks output of protein sub- networks when top 10 transcription factors from ChEA were given as an input Top ranked kinase proteins identified from KEA

Transcript of While gene expression data is widely available describing mRNA levels in different cancer cells...

Page 1: While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible.

While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible for these changes are still poorly understood. Here we developed a rationale approach to infer regulatory mechanisms governing changes in gene expression by integrating datasets of protein/DNA interactions, protein-protein interactions and kinase-substrate interactions collected from prior biological knowledge. We first utilize data obtained from genome-wide ChIP-on-chip and ChIP-Seq experiments to connect mRNA expression levels of the NCI-60 cancer cell lines to the transcription factors most likely regulating them. These identified transcription factors are then “connected”, using known protein-protein interactions, to form cancer specific sub-networks. Within these sub-networks we assess the enrichment for protein kinase substrates to infer the protein kinases likely regulating these complexes. Finally, using quantitative comparison of the up and down regulated genes for each cancer cell line, and genes affected by FDA approved drugs applied to cancer cells, we predict the mechanisms of action of these drugs. Following this path, from changes in gene expression to transcription factors to protein kinases we can provide a more thorough understanding of the regulatory mechanisms behind the observed mRNA levels in the NCI-60 cancer cell lines and other cancer cells. This approach proposes mechanisms of action for drugs. Wet lab experimental validation of this approach is still necessary, it can be done using single drugs or combinations of them.

ChEA Genes2Networks KEA

This research was supported by NIH Grant No. 5P50GM071558

Regulatory Signatures of Cancer Cell Lines Regulatory Signatures of Cancer Cell Lines Inferred from Expression Data Inferred from Expression Data

Jayanth (Jay) Krishnan1,2, Avi Ma’ayan 2 11Mahopac High School, Mahopac, NY 10541Mahopac High School, Mahopac, NY 10541

22 2Systems Biology Center New York and Department of Pharmacology and Systems Systems Biology Center New York and Department of Pharmacology and Systems Therapeutics, Mount Sinai School of Medicine, New York NYTherapeutics, Mount Sinai School of Medicine, New York NY

Regulatory Signatures of Cancer Cell Lines Regulatory Signatures of Cancer Cell Lines Inferred from Expression Data Inferred from Expression Data

Jayanth (Jay) Krishnan1,2, Avi Ma’ayan 2 11Mahopac High School, Mahopac, NY 10541Mahopac High School, Mahopac, NY 10541

22 2Systems Biology Center New York and Department of Pharmacology and Systems Systems Biology Center New York and Department of Pharmacology and Systems Therapeutics, Mount Sinai School of Medicine, New York NYTherapeutics, Mount Sinai School of Medicine, New York NY

AbstractAbstract

Acknowledgements

Introduction

Workflow

• The NCI-60 database provides mRNA profiles from microarray experiments of 60 commonly studies cancer cell lines

• Although analyzing these mRNA values is a reliable method to measure the mRNA level of many genes within a cell, this method offers little clues about how cells are regulated• While mRNA profiles indicates changes caused by cancer, understanding the underlying regulatory mechanisms disregulated in different cancers will bring us closer to therapeutics

• In this project we aim to identify the transcription factors, protein complexes and protein kinases responsible for the aberrant expression of genes in the various types of cancer cell lines

• Differentially expressed gene lists from the various NCI-60 Cancer Cell Lines are used as input. • Over expressed and under expressed genes are identified for specific cancer cell lines• The following algorithm was implemented:

• The NCI-60 database was parsed and 18,133 unique genes were identified

• The population mean for the expression of each of the genes across all the 60 cancer cell lines was calculated

• The sample mean and sigma for each (gene, cancer cell line) pair was calculated

• The two-sided T-test statistic was applied for each (gene, cancer cell line) pair.

• Whether the gene was over expressed or under expressed was calculated by checking whether the test statistic exceeded a critical T score or was a less than a critical T score determined based on a particular P value.

• A list of genes which are over/under expressed for multiple cancer cell lines was developed

Analyzing the mRNA profile from the NCI-60 database

Statistical Methods:

ChEA, Genes2Networks, and KEA are all web-based tools developed at the Ma’ayan lab to allow users to predict which transcription factors, protein sub-networks, and protein kinases are most correlated with their inputted seed list• By using the identified up and down regulated

genes for each cancer cell line as an input for ChEA; the top ranked transcription factors (based on p-value from Fisher’s Exact Test) that most likely influence the input seed list are given as the output

Future Research

• Future research involves further analyzing other cancer datasets• Cluster analysis will be done to groups transcription factors or kinases that were identified • Additionally, by combining such data with data collected for drug perturbation of these cells, we may be able to suggest which drugs can reverse the observed changes

Cancer cell lines Probes 1 2 3 … c … 60 1 M1,1 M1,2 M1,3 … M1,c … M1,60 2 M2, 1 M2,2 M2,3 … M2,c … M2,60 … … … … … … … … n Mn,1 Mn,2 Mn,3 … Mn,c … Mn,60

Population mean µ = ∑ Mi,j / (n * 60) i=1,n; j=1,60 Sample mean of gene expressions for cancer cell line “c” = x̄ = ∑ Mi,c / n i=1,n Std deviation of gene expressions for cancer cell line “c” = s = sqrt (∑ (Xi – x̄ ) 2 / (n-1))

i=1,n Test statistic = ( xbar – µ) * sqrt(n) / s

Genes2Networks•The transcription factor output for each cancer cell line from ChEA is used as an input to Genes2Networks• Genes2Networks connects lists of transcription factors with other protein intermediates from mammalian protein interactions databases

KEA•The unique protein sub-networks outputted by Genes2Networks can then be inputted into KEA which identifies protein kinases most likely regulating the proteins from the subnetwork using the Fisher’s Exact Test.• At this stage top regulating transcription factors, protein sub-networks and kinases have been identified for each of the NCI-60 cancer cell lines• An integrated matrix can now be created in order to holistically compare the data by displaying the top regulating elements and their putative effects on the different cell lines

Example of Process

MicroarrayAnalyze mRNA profile from NCI -60

database by using statistical techniques to compute over/under

expressed genes

Identify protein sub-networks that “connect” the

transcription factors through additional proteins

Wet lab experimental validation

Future Research

Identify protein sub-networks that “connect” the

transcription factors through additional proteins

Wet lab experimental validation

Top ranked protein kinases

most likely regulating the protein sub-networks

PLD1 SART1 WARS DNAJB1 M6PR GPNMB SREBF2 SEC11A

RAE1 GPAA1 SLC29A1 IL13RA1 TRAK2 AKR7A2 USF2 CNOT8

CHPF GAS7 CUL4B SLC6A8 CSNK1E GTF2H1 PCOLCE CHMP2B

PRKCD DHPS SLC37A4 TCTA TIMP2 STX7 SNTA1 PTPN18

CTSF CD302 KAT2B EDNRB NOV HPS5 SMCR7L ACP5

GJB1 AP1S1 ATP7A EIF2S3 DCT DYNC1I1 BCHE GSTT2

NPAS2 SCRG1 TRAIP UBFD1 TYRP1 TRPM2 DDX18 SLC4A3

HBE1 RXRG PLXNB3 KCNS3 S100A3 ASPA SLC22A18AS PPP2R4

TYR GLRX HCG4 GPR143 DGKI CGGBP1 CSRP2 SLC25A11

STAU1 MAGEA1 CTAG2 ZNF200 CAST PDIA6 BEST1 MCM7

HLA-DRA TXN2 TNFRSF14 FAM3A RAB27A SLC1A4 KHDRBS3 ART3

PLP1 HSP90AA1 MAGEA12 CTAG1B NRP2 CAPN3 GYG2 DLAT

SPIN2B C1orf144 SLC25A6 ZCCHC24 SUCLG2 TUBB4 RFNG MORC3

C14orf109 CLCN2 C9orf61 CCPG1 MAGEA2B MAGEA5 AZI1 UAP1L1

DPY19L2P2 DUSP10 SLC6A10P SLC5A4 GK3P SFXN3 HLA-DMA ALDH18A1

BACE2 MUL1 METT11D1 MRPS18A KLF11 ARHGEF3 FAHD2A RINT1

CRIPT MTO1 HEY1 FBXL15 CSGALNACT1 ITIH5 CA14 C14orf139

CADPS2 PRR7 GAL3ST4 NUDT11 CEP97 LONRF3 TINF2 TP53TG3

MGAT4B FAM86C ROPN1B C20orf30 TRIM48 TH1L C5orf54 CDCA3

RPL23AP7 MICALL1 LDLRAP1 C17orf90 LUZP1 C3orf64 GPR177 COL9A1

LOC348926 PCOTH FAM86B1 LPCAT2 SURF4 XPO5 PDXP COPG2

HAGHL TNFSF13B FAM167B SPRYD5 DGAT2 C2orf30 C6orf89 UBL7

ULK3 TOMM40L FAM160B1 SNX30 TMEM55A GGT7 C12orf34 C3orf38

HDAC10 LOC400657 AFAP1L1 FAM125A OLIG1 C11orf82 ENHO CITED4

HSD11B1L SCARNA15 SMYD4 LOC153364 CAMK2N2 PAGE2 LOC730124 GBGT1

CHRM1 AARS2 ANKRD54 KIAA1524 KIAA1586 ZC3H12C FSTL5 CLEC2L

GPR158 SLITRK4 GNASAS ELOVL3 ST6GALNAC3 LRRC33 NPHP3 HMCN1

RNF175 C5orf35 LOC147645 LOC730259 TMEM171 DLX1

Top 222 over expressed genes for cancer cell line MDA_N (melanoma)

With gene input, ChEA identified the top ranked transcription factors

Genes2Networks output of protein sub-networks when top 10 transcription factors from ChEA were given as an input

Top ranked kinase proteins identified from KEA