Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.
description
Transcript of Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.
Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.
Chris Stoeckert, Ph.D.
Center for Bioinformatics & Dept. of Genetics
University of Pennsylvania School of Medicine
Nov. 15, 2005
University of Nebraska Medical Center
What is the code for determining where (and when) a gene is expressed?
http://molbio.info.nih.gov/molbio/gcode.html
Expression
TFBS1 TFBS4TFBS3
TFBS3
TFBS4
TFBS2
TFBS2
TFBS1
TFBS = transcription factor binding site
Goal is to Identify Combinations of TFBS (cis-Regulatory Modules or
CRMs) that Specify Tissue Expression
From Wasserman & Sandelin, NRG 2004
A Genomics Unified Schema approach to understanding
gene expression
Jennifer Dommer, Steve Fischer, Thomas Gan, Greg Grant, John Iodice, Junmin Liu, Elisabetta Manduchi, Joan
Mazzarelli, Debbie Pinney, Angel Pizarro, Mike Saffitz, Jonathan Schug, Chris Stoeckert, Trish Whetzel
Computational Biology and Informatics Laboratory (CBIL), Penn Center for Bioinformatics
Cryptospiridium Database
Beta Cell Biology Consortium
Plasmodium Genome Resource
Phytophthora SoybeanEST Database
GUS
GUS
GUS is an open source projectSanger Institute
U. Georgia
Flora Centromere
Database
U. ChicagoKansas U.
U. Penn
U. Toronto
Virginia BioinformaticsInsitiute
GUS Project Goals
• Provide:– A platform for broad genomics data integration– An infrastructure system for functional
genomics
• Support:– Websites with advanced query capabilities– Research driven queries and mining
GUS Project Resources• Website -- http://www.gusdb.org
– News, Documentation, Distributable, GUS-based Projects
GUS Components
• Schema
• Application Framework
– Object/Relational Layer
– Plugin API
– Pipeline API
• Plug-ins
• Web DevelopmentKit (WDK)
Schemas Domain Features
DoTS Sequence and annotation
EST clustersGene models
RAD Gene expression MIAME
Prot Protein expression Mass specMzdata/pepXML
Study Experiments FuGE
TESS Gene Regulation TFBS organization
SRes Shared resources Ontologies
Core Administration Documentation, Data Provenance
GUS 3.5 Schemas
RAD EST clustering and assembly
DoTS
Genomic alignmentand comparativesequence analysis
Identify sharedTF binding sites
TESS
BioMaterial annotation SRes, Study
DoTS integrates sequence annotation including where expressed
RAD Contains Detailed Expression Experiments Including Tissue Surveys
TESS Allows You to Find Potential TFBS
But there are too many potential sites!
Promoters Features Related to Tissue-Specificity as Measured by Shannon
Entropy
Jonathan Schug1, Winfried-Paul Schuller2, Claudia Kappen2, J. Michael Salbaum2, Maja Bucan1, Christian
J. Stoeckert Jr.1
1. Center for Bioinformatics, Department of Genetics, University of Pennsylvania, Philadelphia, Pennsylvania
2. Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, Nebraska
Genome Biology 2005 6:R33
What is a Liver-Specific Gene?
*http://expression.gnf.org/
Assessing Tissue Specificity of Genes Using Shannon Entropy
Shannon entropy is a measure of the uniformity of a discrete probability distribution. Given a set of T tissues, H ranges from 0 for a gene expressed in a single tissue to lg T for a gene expressed uniformly in all T tissues. It works well as a measure of overall tissue-specificity.
To measure specificity to a particular tissue, we combine entropy H and the relative expression level in that tissue to get Q. Q = 0 for a tissue when the gene is expressed only in that tissue and Q = 2T for a typical tissue in uniform expression.
(a) Very specific liver expression: H=1.6 and Qliver = 2.2, 98612_at cytochrome p450
(b) Near uniform expression : H=4.3 and Qliver=10.2, 104391_s_at Clcn7 chloride channel 7
Agreement between Microarrays and ESTs on Tissue Specificity
Specificity Characteristics of Tissues
TissueProbe SetID
H Q RefSeq Description
96055_at 3.2 5.8 NM_031161 cholecystokinin
93178_at 2.7 5.8 NM_019867neuronal guanine nucleotideexchange factor
93273_at 3.7 5.8 NM_009221 synuclein, alpha
92943_at 3.5 6.0 NM_008165glutamate receptor, ionotropic,AMPA1 (alpha 1)
Amygdala
95436_at 3.3 6.1 NM_009215 somatostatin
98406_at 2.7 4.0 NM_013653chemokine (C-C motif) ligand5
98063_at 1.6 4.1 -glycosylation dependent celladhesion molecule 1
99446_at 2.5 4.1 NM_007641membrane-spanning 4-domains, subfamily A, member1
92741_g_at 3.3 4.5 -immunoglobulin heavy chain 4(serum IgG1)
Lymph Node
102940_at 2.8 4.6 NM_008518 lymphotoxin B94777_at 1.3 2.1 - albumin 1101287_s_at 1.6 2.2 NM_010005 cytochrome P450, 2d1099269_g_at 1.5 2.2 NM_019911 tryptophan 2,3-dioxygenase100329_at 1.4 2.3 NM_009246 serine protease inhibitor 1-4
Liver
94318_at 1.6 2.3 NM_013475 apolipoprotein H
CpG Islands are Associated with the Start Sites of Genes with Wide-Spread Expression
CpG island = minimum 200 bp, C+G > 0.6, obs./expect. >=0.5
Tissue-Specific and Non-Specific Promoters Have Distinct Base Compositions
CpG+ CpG-
Multi-TissueH >= 4.4
TissueSpecificH <= 3.5
Promoters based on DBTSS (http://dbtss.hgc.jp)
TATA Boxes are Associated with Tissue-Specific Genes
p h = 0.13; p m = 0.15
p h = 0.00007; p m = 0.00087
p h = 0.00005; p m = 0.00001
0
10
20
30
40
50
60
70
80
90
0-2 2-4 4-6 6-8 8-10 >10
Q-Value
% with TATAA Box
human
mouse
(7/9)
(8/9)
(4/8)
(8/28)
(16/80)
(3/8) (10/28)
(16/80)
genes with
TATAA Box
human 18.8%
mouse: 22.9%
(4/31)
(2/27)
(9/35)
(3/27)
CellularComponent
BiologicalProcess
Human Only Mouse Only
extracellular,extracellular space
microsome, vesicular fraction intermediate filament(cytoskeleton)
CGI-/TATA+
response tostimulus
organismal physiological processinflammatory responseinnate immune responsecell motilitydefense responseresponse to pest/pathogen/parasiteresponse to woundingresponse to biotic stimuluscell-cell signalingmorphogenesisdigestionmuscle contraction
chemotaxis,taxis,response to chemicalsubstance,response to abioticstimulus,muscle development
cell, cytoplasm,intracellular,mitochondrion
nucleus, ribonucleoproteincomplex
CGI+/TATA-
nucleobase, nucleoside, nucleotideand nucleic acid metabolismintracellular transportmetabolismprotein transportintracellular protein transportRNA processingRNA metabolismcell cyclemitotic cell cycle
(integral to)(plasma)membrane
extracellular,extracellular space
CGI-/TATA-
organismalphysiologicalprocess, defenseresponse, immuneresponse, responseto biotic stimulus,response tostimulus
response to pest/ pathogen/parasite, cell communication,response to wounding, cellulardefense response, signaltransduction
complement activation,complement activation(classical pathway),humoral defensemechanism (sensuVertebrata),humoral immuneresponse
Functional relationships of promoter classes based on over-represented GO terms (EASE)
First Clues: TATA Box indicates Tissue Specific;
CpG indicates Wide Spread Expression
Additional clues: CpG-/TATA+ indicates high expression, secreted proteins while CpG+/TATA- indicates cellular and mitchondrial proteins.
Expanding the Mammalian CArGome
Qiang Sun1, Guang Chen2, Jeffrey W. Streb1, Xiaochun Long1, Yumei Yang1, Christian J. Stoeckert, Jr.2 and
Joseph M. Miano1
1. Cardiovascular Research Institute, University of Rochester School of Medicine, Rochester, New York
2. Center for Bioinformatics, University of Pennsylvania, Philadelphia, Pennsylvania
Genome Research (in press)
Serum Response Factor (SRF) Target Genes
Finding Novel CArG elements
• Expect 1 CArG element about every kb just by chance. – CCWWWWWWGG with one
mismatch allowed• Use conservation to reduce false
positives.– 188 associated with 4362
orthologous genes– 116 had orthologous CArGs– 10/62 known genes found– Repeated with 9169 orthologous
genes• 489 predictions• 32/62 known genes found
• 60 of 83 predictions were experimentally validated– Transfection assays– Binding assays– Knockdown assays
Serum Response Factor (SRF) Target Genes
More Clues: Human-mouse conservation enables
identification of valid CArG elements
CArG elements associated with many cytoskeletal genes suggesting role of SRF in cytoskeletal dynamics.
Using Bounded Collection Grammars to Identify cis-
Regulatory Modules in Tissue Specific Genes
Jonathan Schug
Max Mintz (CIS, U Penn)
Bounded Collection Grammars
Collection production rules for the GR response element in the PEPCK promoter
Rules are evaluated using the receiver operating characteristic (ROC)
Each point is a different parameter setting for a rule applied to training sets. Typically use area under the curve (AUC) to rank rules.
Rules are built by increasing complexity when AUC improves
Reduce search space by not pursuing unproductive paths.e.g., If (A,B) not better than A or B then don’t need to look at (A,B,C) or (A,B,D) or (A,B,C,D)
The 3-set rule for the PEPCK GR element
Note improvements of 2-sets over solos and the 3-set over 2-sets.
Discovering regulatory modules by creating profiles for Gene
Ontology Biological Processes based on tissue-specificity scores
Elisabetta Manduchi, Jonathan Schug
Klaus Kaestner (Genetics, U Penn)
If we focus on biological processes that are predominantly taking place in a given tissue, can we identify regulatory modules common to genes involved in these processes?
TissueBiological Process
Genes
For a given tissue survey, we attach “tissue-specificity” profiles to gene sets defined by GO BPs, based on the ranked lists of genes in each tissue according to increasing Q.
• To this end, we use an Enrichment Score (ES) in the spirit of that described in Mootha et al. (2003), as a measure of tissue-specificity for that gene set.
• The ES turns out to be equivalent (i.e. equal up to a multiplicative constant) to a Kolmogorov-Smirnov statistic.
......
...
liver muscle brain
**
*
*
*
*
***
*steroid metabolism
• Have applied to two different Affymetrix-based datasets– Schmueli et al. C R Biol 2003. GeneNote
(human)– Su et al. PNAS 2004. GEA2 (human and
mouse)
• We looked at ~ 2000 GO BPs that we could map to probe sets
Application to Tissue Surveys
GO BPs having significantly specific profiles for each tissue can be identified
significant in liver significant in heartand skeletal muscle
Mouse Tissue Survey Human Tissue Survey
Tissue-specific GO BPs Tissue-specific GO BPs
Reduce and Intersect
Training Setof Promoter Sequences
Training Setof Promoter Sequences
Ortholog Pairs(Homologene)
Learning Tissue-Specific Promoter Motifs
Mm-based consensusconserved sequences
Hs-based consensusconserved sequences
32 POS, 365 NEGUCSC conserved
sequences
Positive Solos Positive Solos
Liver-specificSteroid metabolism
GEA2
ROC area > 0.5
Common solos(31)
Mm-based collections(30)
Hs-based collections(83)
(13) Common collections
GATA MYCMAC/USF AIRE CAAT ER-LEFT TCF11TTAC/EFC/NCX/VBP PBX INIAATC SREBP DBPForkhead E4B EBOXCCAA CREB/ATF S8/CART1/CHX10/NKX25G_AA/CEBP/HLF TAACC LXRHNF1 ALX4 HNF4/TCF4/COUP/PPAR PPAR-LEFT ROAZ AML/PEBPBACH/NFE2/NRF2 ZTAP53 GNCF/SF1
Liver-specific from Krivan and Wasserman (2001)Known Liver TFs
Learning Liver Specific CRMs for Steroid Metabolism
• AIRE • P53• ER-LEFT • {CREB/ATF, GATA}• {GATA, GNCF/SF1}• {Forkhead, GATA} • {GATA, G_AA/CEBP/HLF}• {GATA, SREBP}• {GATA, TAACC}• {GATA, ZTA}• {AATC, SREBP}• {CAAT, SREBP}• {BACH/NFE2/NRF2, G_AA/CEBP/HLF}
Without imposing prior knowledge, end up with rules that are highly enriched for TFs expected to play a role in liver-specific streroid metabolism.
Testing a learned CRM for liver steroid metabolism using a liver HNF3-beta (FoxA2)
knock out mouse study
• {Forkhead, GATA} set rule applies to HNF3-beta/FoxA2
• Search promoters of genes down-regulated in liver as measured on PancChip microarray– Pancreas-focused array with 7356 known genes.
• 52 (0.7%) map to steroid metabolism.
– 71 genes down-regulated• 7 (10%) map to steroid metabolism.
Genes down-regulated by knockout of a forkhead protein (Hnf3-beta) are significantly enriched in steroid metabolism
More Clues: We can identify candidate CRMs from top-ranking GO Biological Processes for tissues
Tested a candidate CRM for liver steroid metabolism with a knock-out mouse. Support for role of one of the factors but not enough sensitivity for seeing both factors.
Future Directions
• Apply learning methods to many tissues and processes incorporating multiple surveys
• Add novel motifs to learning process
• Use ChIP and tissue-focused expression datasets to better evaluate
Our goal is to make inferences of the form: "The gene set G shows specificity for tissue T and is regulated by module M in this context".
http://www.cbil.upenn.edu