Transcriptional and post-transcriptional regulation of gene expression
description
Transcript of Transcriptional and post-transcriptional regulation of gene expression
mRNA
protein
DNA ActivationRepression
TranslationLocalizationStability
Pol II
3’UTR
Transcriptional and post-transcriptional regulation of gene expression
• Where does each transcription factor bind in the genome, in each cell type, at a given time ? Near which genes ?
• What is the cis-regulatory code of each factor ? Does they require any co-factors ?
DNA ActivationRepression
ChIP-seq
Genome Analyzer II (Solexa)
Transcription factor of interest
Antibody
Control: input DNA
Genome Analyzer II (Solexa)
ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGAACTGATTAGTGAATTCTGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTTGACTAATCACTTAAG
Average length ~ 250bp
ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGAACTGATTAGTGAATTCTGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTTGACTAATCACTTAAG
Average length ~ 250bp
25-40bp
ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGAACTGATTAGTGAATTCTGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTTGACTAATCACTTAAG
Average length ~ 250bp
25-40bp
BCL6 ChIP-seq• Lymphoma cell line (OCI-Ly1)• Solexa/Illumina• 6 lanes for ChIP, 1 for input DNA, 1 for QC• 36nt long sequences• 32 Million reads• Aligned/mapped to hg18 with Eland
Melnick lab at WCMC
AAAAATTCTCCCAAAACAAAAAAATACGCGTATTCTCCCAAAACAATATCTTACAAGATGTAAATATACCCAAGATG
Reference Human Genome (hg18)
AAAATACGCGTATTCTCCCAAAACAATATC
Solexa Read
Read mapping with Eland
AAAAATTCTCCCAAAACAAAAAAATACGCGTATTCTCCCAAAACAATATCTTACAAGATGTAAATATACCCAAGATG
Reference Human Genome (hg18)
AAAATACGCCTATTCTCCCAAAACAATATC
Solexa Read
Read mapping with Eland
AAAAATTCTCCCAAAACAAAAAAATACGCGTATTCTCCCAAAACAATATCTTACAAGATGTAAATATACCCAAGATG
Reference Human Genome (hg18)
AAAATACGCCTATTCTCCCATAACAATATC
Solexa Read
Read mapping with Eland
Reads can map to multiple locations/chromosomes
Solexa Read 1 Solexa Read 2
Reference Human Genome (hg18)
Reads map to one strand or the other
Solexa Read 1 Solexa Read 2
hg18
>HWI-EAS83_30UCEAAXX:1:2:915:1011AGGTCACAAAACAAGTCCTAACAAATTTAAGAGTAT U0 1 13 62 chr8.fa 59699745 R DD>HWI-EAS83_30UCEAAXX:1:2:826:1245GTCAGAAAAATCCTTTTTATTATATAAACAATACAT U2 0 0 1 chr5.fa 121195098 F DD 15G 20G>HWI-EAS83_30UCEAAXX:1:2:900:945 GTCATCAAACTCCAAGGATTCTGTTTTCAACATACT U0 1 1 0 chr18.fa 8914049 R DD>HWI-EAS83_30UCEAAXX:1:2:1037:1118 GAAAGTGATTAGCAGATTGTCATTTAATAATTGTCT U2 0 0 1 chr1.fa 97496963 F DD 18G 28G>HWI-EAS83_30UCEAAXX:1:2:898:874 GATAAATTTTTTCCTACAATCTTAAATTATTACACA U1 0 1 0 chr3.fa 95643444 R DD 10C>HWI-EAS83_30UCEAAXX:1:2:918:928 AAAAATTAAACAATTCTAAAAATATTTTTATCTTAA U2 0 0 1 chr2.fa 177727639 R DD 18C 31G>HWI-EAS83_30UCEAAXX:1:2:1324:4 GCACATGTCATACTCTTTCTAGCTCTCTTATTTTTC U0 1 0 0 chr8.fa 79132719 R DD>HWI-EAS83_30UCEAAXX:1:2:899:1015AAATTAATGTAAAAAATAGGATACTGAATTGTGATA U1 0 1 0 chr10.fa 69774166 F DD 30G>HWI-EAS83_30UCEAAXX:1:2:909:926 GTAGTTAACAATAATTTATTTTATACTTCAAAATTC U1 0 1 17 chrX.fa 26496842 R DD 7A>HWI-EAS83_30UCEAAXX:1:2:701:1702GTCAGAATTAATTAATCAAAACACCAAATGTACTTC U0 1 0 0 chr12.fa 72700465 F DD>HWI-EAS83_30UCEAAXX:1:2:996:1003ATTTTGACTTTATTATTTTTTCTTCAATGTTTTTAA NM 0 0 0>HWI-EAS83_30UCEAAXX:1:2:884:1090GAAAGTACATCAAATACATATTATATACTTTACATA R2 0 0 2>HWI-EAS83_30UCEAAXX:1:2:911:937 AATCCATATACATTTCTTTTTAATCATTTCCTCTTT U1 0 1 0 chr11.fa 94204222 F DD 20G>HWI-EAS83_30UCEAAXX:1:2:1517:330GTGAGTTTCTTAATCCTGAGTTCTAATTTTATTTCA R0 29 255 255>HWI-EAS83_30UCEAAXX:1:2:904:1031ACATTTTATAAATTTTTAATTTCATTTTAATTTATA NM 0 0 0>HWI-EAS83_30UCEAAXX:1:2:1291:1469 GTTTTTAAAATCAACACTTTTATTATAGAAGTAGCA U0 1 0 1 chr12.fa 62166701 R DD>HWI-EAS83_30UCEAAXX:1:2:1697:828GTACTGATGTAAACTTGGTAAAAACATTGACATAAA U0 1 0 0 chr14.fa 65160857 F DD>HWI-EAS83_30UCEAAXX:1:2:1415:583GAAGAAAATGACTATGTCAAAATATTATCTCTCAAT U0 1 0 0 chr5.fa 97782464 F DD>HWI-EAS83_30UCEAAXX:1:2:1561:1653 GTTTTACTGATTTTCTTACTTACTAAACTACCTGTT U0 1 0 0 chr7.fa 133200265 F DD>HWI-EAS83_30UCEAAXX:1:2:1579:943AATGATACGGCGACCACCGACAGGTTCAGAGTTCTA NM 0 0 0>HWI-EAS83_30UCEAAXX:1:2:1705:268GAGAATTATTCAGAAGTCAAATCTGTGCTTAGTTTA U2 0 0 1 chr5.fa 162472124 R DD 3G 7C>HWI-EAS83_30UCEAAXX:1:2:1489:318GTATGTATCATATATATTTATGTATCATATATATTT R1 0 3 2>HWI-EAS83_30UCEAAXX:1:2:1003:1113 GATTGCTCCATTATTTGTTAAAAACATAGTAAAATA NM 0 0 0>HWI-EAS83_30UCEAAXX:1:2:895:1072ATGAGATCAGTACTTCAAAGAGATATCTGCACTCCC U0 1 1 9 chr12.fa 33830898 R DD>HWI-EAS83_30UCEAAXX:1:2:853:1178GTTAGTCCCAATATTCCATTAATCCCAATAAATATA U2 0 0 1 chr6.fa 110722427 F DD 15G 19G>HWI-EAS83_30UCEAAXX:1:2:1432:972GAGATAATAATAGCAGTTATGGCATCGAGATAATTT U0 1 0 0 chr2.fa 47305609 R DD>HWI-EAS83_30UCEAAXX:1:2:1718:341GTAGAGGGCACACATCACAAACAAGTTTCTGAGAAT R2 0 0 3>HWI-EAS83_30UCEAAXX:1:2:1171:302GAATATCCACTTGCAGACTTTACAAACAAATTTTTT R2 0 0 4>HWI-EAS83_30UCEAAXX:1:2:1055:1126 GGCAGATGAAACTTCTATACACTATATTTTAGCCAG U0 1 0 0 chr13.fa 90021137 F DD>HWI-EAS83_30UCEAAXX:1:2:971:1371GAAAGAAAAACTATTGAAAAAATAGTTACTTTCCAA U0 1 0 0 chr1.fa 74303257 R DD>HWI-EAS83_30UCEAAXX:1:2:1774:614GTGTAGATGATATCGAGGGCATTAGAAGTAAATAGC U0 1 0 0 chr5.fa 16031200 F DD>HWI-EAS83_30UCEAAXX:1:2:1207:808GAGAGGAAATAATAAAGATAAAAGTAGAAAAAGTGA U0 1 0 0 chr1.fa 187326417 F DD>HWI-EAS83_30UCEAAXX:1:2:1680:815GATAATTATGTTGTTGTAATTATTGTTTGTTTTTTT U0 1 0 0 chr15.fa 46739015 R DD>HWI-EAS83_30UCEAAXX:1:2:1688:260GTTGACAATCCAGCTGTCATAGAAACTGACTATTTT U0 1 0 0 chr12.fa 38910133 R DD>HWI-EAS83_30UCEAAXX:1:2:1051:916AAAAATTCTCCCAAAACAACAAGATGTAAATATACC U0 1 0 0 chr3.fa 101625712 R DD>HWI-EAS83_30UCEAAXX:1:2:1771:308GTTCTTACACTGATATGAAGAAATACCTGAGACTGG U0 1 2 67 chr2.fa 214128537 R DD>HWI-EAS83_30UCEAAXX:1:2:911:917 GAGAAACACACATATTTTTGTAAGTGCCATCACATC U1 0 1 0 chr7.fa 13668652 R DD 18C>HWI-EAS83_30UCEAAXX:1:2:1105:348GTATTATCTAACACACAAGATGATGTTTGTTTTTAT NM 0 0 0>HWI-EAS83_30UCEAAXX:1:2:1048:857GAGTGTAGAAAATTTTCTGCCCTAAAATATTTGTTA U1 0 1 0 chr6.fa 74625385 F DD 13G>HWI-EAS83_30UCEAAXX:1:2:743:1729GTATCCTAAAGTGTATCTTATGTTTTTTCATCTTCT U1 0 1 0 chr12.fa 7400023 R DD 9C>HWI-EAS83_30UCEAAXX:1:2:1287:64 AATAAAACAAATTCCAATGGCTTAGATTCTACTTAA U2 0 0 1 chr10.fa 98020799 R DD 15C 20C>HWI-EAS83_30UCEAAXX:1:2:940:1059AAATGGTCATACTTCCCAAAGCGATCTACAGATTCA U1 0 1 29 chr3.fa 50834510 R DD 19C>HWI-EAS83_30UCEAAXX:1:2:898:1061ACATTTCCACATTTCTGTGGAAGCCTCACAATCATT R2 0 0 2>HWI-EAS83_30UCEAAXX:1:2:913:932 ATTAATCAACAGCAACATTAATCAACTGAATCAACA U0 1 0 0 chr2.fa 46078825 R DD>HWI-EAS83_30UCEAAXX:1:2:43:1647 GAATAAATAATCAAAACATATAATACATTTTTTTAT U1 0 1 0 chr5.fa 41496935 F DD 32G>HWI-EAS83_30UCEAAXX:1:2:1412:731ATATACACATATATATACATATATATATACACATAT R0 47 255 255>HWI-EAS83_30UCEAAXX:1:2:1389:1196 GAGAAGGAAATGTGTTTTCTAAGTTTCTTTATCTTC U1 0 1 0 chr4.fa 188020201 F DD 32G>HWI-EAS83_30UCEAAXX:1:2:1264:1479 GTGTAGGAAAGAAAAAAGGAGGTTGTGTAGAAAAGA U0 1 0 0 chr2.fa 192227804 F DD>HWI-EAS83_30UCEAAXX:1:2:38:890 TTTATTTAAATCTTTTAAAAANTTTTTTCCAACAAA NM 0 0 0>HWI-EAS83_30UCEAAXX:1:2:1341:1065 GATACATATACACAAAGTAAAACTATTCAGCCTCTA U0 1 0 0 chr17.fa 51416321 F DD>HWI-EAS83_30UCEAAXX:1:2:1132:929GAGTTGTATTAATCTTAAATTGATAATTTACCATAT U1 0 1 0 chr10.fa 2376138 F DD 24G>HWI-EAS83_30UCEAAXX:1:2:1758:275GCATTTTAACAAAATCACCATATCTGGGTAACCATT U1 0 1 0 chr21.fa 27648337 R DD 18C>HWI-EAS83_30UCEAAXX:1:2:914:1000GAAAGCACTTTATAATAAAACAACATTGGAGCACCT U1 0 1 0 chr8.fa 67496303 F DD 16G
Number of reads per Eland typeU0 21019702 65%U1 3280059 10%U2 1007173 3%R0 3661054 11%R1 815275 2%R2 406002 1% NM 2050499 6%QC 306352 1%
Peak detection• Calculate read count at each position (bp) in
genome
• Determine if read count is greater than expected
Peak detection
• We need to correct for input DNA reads (control)
• - non-uniformaly distributed (form peaks
too)
- vastly different numbers of reads between ChIP and input
Peak detection using ChIPseeqer
Read count
genome
Expected read count
Expected read count = total number of reads * extended fragment length / chr length
genome T A T T A A T T A T C C C C A T A T A T G A T A T
Is the observed read count at a given genomic position greater
than expected ?
€
P(X ≥ x) =1− λxe−λ
x!0
x−1
∑x = observed read countλ = expected read count
The Poisson distribution
Read count
Freq
uenc
y
Is the observed read count at a given genomic position greater
than expected ?
€
P(X ≥ x) =1− λxe−λ
x!0
x−1
∑x = 10 reads (observed)λ = 0.5 reads (expected)
The Poisson distribution
genomeP(X>=10) = 1.7 x 10-10
log10 P(X>=10) = -9.77
-log10 P(X>=10) = 9.77
Read count Expected read count
-Log(p) €
Pc (X ≥ x) =1− λ cxe−λ c
x!0
x−1
∑
Expected read count = total number of reads * extended frag len / chr len
Read count
Expected read count
Input reads
-Log(p)
Expected read count = total number of reads * extended frag len / chr len
Read count
Expected read count
-Log(Pc)
Read count
Expected read count
-Log(Pi)
€
Pc (X ≥ x) =1− λ cxe−λ c
x!0
x−1
∑
Log(Pc) - Log(Pi)
€
Pi(X ≥ x) =1− λ ixe−λ i
x!0
x−1
∑
Threshold
Genome positions (bp) Genome positions (bp)
INPUTChIP
Normalized Peak score (at each bp)
R = -log10 P(Xinput) P(XChIP)
Will detect peaks with high read counts in ChIP, low in Input
Works when no input DNA !
€
Pi(X ≥ x) =1− λ ixe−λ i
x!0
x−1
∑
Non-mappable fraction of the genome
• chr18 9369067/76117153 0.123087459668913 (=12%)• chr2 33849240/242951149 0.139325292921335• chr3 27854877/199501827 0.139622164963933• chr4 27090014/191273063 0.141630052737745• chr6 24330283/170899992 0.142365618132972• chr8 20932821/146274826 0.143106107677065• chr5 26029902/180857866 0.143924633059643• chr12 19382853/132349534 0.14645199279659• chr11 20039443/134452384 0.149044906485258• chr20 10017788/62435964 0.160449000194824• chr7 26182588/158821424 0.164855517225434• chr10 22968951/135374737 0.169669404417753• chr17 14496284/78774742 0.184021980040252• chrX 31269270/154913754 0.201849540099583• chr1 55186693/247249719 0.223202247602959• chr13 28668063/114142980 0.251159230291692• chr16 23552340/88827254 0.265147676410215• chr14 29689825/106368585 0.279122120502026• chrM 4628/16571 0.279283084907368• chr9 43125838/140273252 0.307441635415995• chr19 20251255/63811651 0.317359834491667• chr15 31877970/100338915 0.317702957023205• chr21 16867677/46944323 0.359312392256674• chr22 21176578/49691432 0.426161556382597• chrY 43209644/57772954 0.747921665906161 (=74%)
We enumerated all 30-mers, counted # occurrences, calculated non-unique fraction of
genome
Peak detection• Determine all genomic regions with
R>=15
• Merge peaks separated by less than 100bp
• Output all peaks with length >= 100bp
• Process 23M reads in <7mins
ChIP reads
Input reads
Detected Peaks
BCL6: 18,814 peaks
80% are within <20kb of a known gene
• Where does each transcription factor bind in the genome, in each cell type, at a given time ? Near which genes ?
• What is the cis-regulatory code of each factor ? Does they require any co-factors ?
DNA ActivationRepression
Regulatory Sequence Discovery using FIRE
No No
No
No
No
No
…
Random regions
Discovering regulatory sequences associated with
peak regionsTrue TF binding peak?
Yes
Yes
Yes
Yes
Yes
Yes
…
Target regions True TF peak
Absent
Present
No Yes
€
I(motif ; groups) = P(i, j)log P(i, j)P(i)P( j)j=1
2
∑i=1
2
∑M
otif
correlation is quantified using the mutual information
Motif Search Algorithmk-mer MI CTCATCG 0.0618TCATCGC 0.0485AAAATTT 0.0438GATGAGC 0.0434AAAAATT 0.0383ATGAGCT 0.0334TTGCCAC 0.0322TGCCACC 0.0298ATCTCAT 0.0265......ACGCGCG 0.0018CGACGCG 0.0012TACGCTA 0.0011ACCCCCT 0.0010CCACGGC 0.0009TTCAAAA 0.0005AGACGCG 0.0004CGAGAGC 0.0003CTTATTA 0.0002
Not informative
Highly informative
...
MI=0.081
MI=0.045
MI=0.040
No No
No
No
No
No
…
Random regions
Optimizing k-mers into more informative degenerate motifs
ATCCGTACA
ATCC[C/G]TACA
which character increases the mutual information by
the largest amount ?
A/G
T/GC/G A/C/G
A/T/G
C/G/T
True TF binding peak?
Yes
Yes
Yes
Yes
Yes
Yes
…
Target regions
Optimizing k-mers into more informative degenerate motifs
ATCC[C/G]TACA
A/C
T/CC/G A/C/G
A/T/C
C/G/T
.
.
. No No
No
No
No
No
…
Random regions
True TF binding peak?
Yes
Yes
Yes
Yes
Yes
Yes
…
Target regions
change
Motif Conservation with S. bayanus
Similarity to ChIP-chip RAP1 motif
Mutual information
k-mer MI CTCATCG 0.0618TCATCGC 0.0485AAAATTT 0.0438GCTCATC 0.0434AAAAATT 0.0383ATGAGCT 0.0334TTGCCAC 0.0322TGCCACC 0.0298ATCTCAT 0.0265...
Highly informative k-
mers
Only optimize k-mer if
I(k-mer;expression | motif)
is large enough
(for all motifs optimized so far)
MI=0.081
MI=0.045
Motifs optimized so far
optimize ?
Conditional mutual information I(X;Y|Z)
Enric
hmen
tDe
pleti
on
Motif co-occurrence anallysis
Discovered Motifs
FIRE automatically compares discovered motifs to known motifs in TRANSFAC and
JASPAR
ChIPseeqer: an integrated framework for ChIP-seq data
analysis• ChIPseeqer (peak detection)• ChIPseeqer2Track (for Genome Browser)• ChIPseeqer2FIRE (+ motif analysis)• ChIPseeqer2iPAGE (+ pathway analysis)• ChIPseeqer2cons (conservation analysis)
Installing and setting up programs
Install ChIPseeqer and FIRE:http://physiology.med.cornell.edu/faculty/elemento/lab/chipseq.shtmlhttp://tavazoielab.princeton.edu/FIRE/
Execute following commands:
export FIREDIR=/Applications/FIRE-1.1 export PATH=$PATH:$FIREDIR export CHIPSEEQERDIR=/Applications/ChIPseeqer-1.0 export PATH=$PATH:$CHIPSEEQERDIR:$CHIPSEEQERDIR/SCRIPTSchmod +x $CHIPSEEQERDIR/ChIP*chmod +x $CHIPSEEQERDIR/SCRIPTS/*.pl
Peak Detection- Input file: CTCF.bed cd ~/Desktop/elementoOr download from:http://physiology.med.cornell.edu/faculty/elemento/lab/files/chipseq/- 2947043 U0 reads in BED format (check by typing wc –l CTCF.bed) (view by typing more CTCF.bed and q to exit)
- No input DNA for this experiment
Peak DetectionStep 1: Split big read file into one file per chromosome
split_bed_or_mit_files.pl CTCF.bed
Expected output:
Opening CTCF.bedCurrent directory = .Creating ./reads.chr1 …
Peak DetectionStep 2. Detect peaks
ChIPseeqer --chipdir=. --t=15 --fraglen=250 --format=bed -outfile=CTCF_peaks_t15.txt
Expected output:
Processing reads in chrY ... done.Processing reads in chrX ... done.Processing reads in chr9 ... done.Processing reads in chr8 ... done.
Step 3. Count how many peaks were found
wc -l CTCF_peaks_t15.txt
Making a Genome Browser track
Command lines:cd JuliaChildwc –l CTCF_peaks_t15.txt ChIPseeqer2track --targets=CTCF_peaks_t15.txt --trackname=“CTCF peaks”
Expected output:
CTCF_peaks_t15.txt.wgl.gz created.
To check that the file was created:
ls
Making a Genome Browser track
http://genome.ucsc.edu/cgi-bin/hgGateway
Making FIRE input filesCommand line (type instructions below as one single line):
ChIPseeqer2FIRE --targets=CTCF_peaks_t15.txt –genome=wg.fa --suffix=CTCF_peaks_t15_FIRE
wg.fa is also available from:http://physiology.med.cornell.edu/faculty/elemento/lab/files/chipseq/(decompress with gunzip wg.fa.gz)Expected output:
Extracting sequences ... Done.Extracting randomly selected sequences ... Done.CTCF_peaks_t15_FIRE.txt and CTCF_peaks_t15_FIRE.seq have been generated.…
FIRE analysisCommand line (type instructions below as one single line):
fire.pl --expfile=CTCF_peaks_t15_FIRE.txt --fastafile_dna=CTCF_peaks_t15_FIRE.seq --nodups=1 --minr=2 --species=human --dorna=0 --dodnarna=0
Expected output:
Extracting sequences ... Done.Extracting randomly selected sequences ... Done.CTCF_peaks_t15_FIRE.txt and CTCF_peaks_t15_FIRE.seq have been generated.…
FIRE main output file
Peak sequences
Randomly selected
sequences
open CTCF_peaks_t15_FIRE.txt_FIRE/DNA/CTCF_peaks_t15_FIRE.txt.summary.pdf