Array Informatics Mark Gerstein
description
Transcript of Array Informatics Mark Gerstein
Do not reproduce without permission 1 L
ec
ture
s.G
ers
tein
La
b.o
rg
(c)
20
07
Array Informatics
Mark Gerstein
Do not reproduce without permission 2 L
ec
ture
s.G
ers
tein
La
b.o
rg
(c)
20
07
CEGS Informatics Developing Tools and Technical Analyses Related to
Genome Technologies
• Main Genome Technologies Tiling Arrays Next Generation Sequencing
• Main Applications Transcript mapping Protein-DNA Binding CGH
• Transitioning to Seq....
Do not reproduce without permission 3 L
ec
ture
s.G
ers
tein
La
b.o
rg
(c)
20
07
Tools & Tech. Analyses for Processing of Genome Technology Data
• Normalizing Arrays and Measuring & Correcting Artifacts COP - Correcting positional artifacts [Yu et al. NAR '07] Efficient Pseudomedian Calculation - for Tiling Array Scoring
[Royce et al., BMC Bioinfo. '07] Measuring Mismatch Effects
[Seringhaus et al., BMC Genomics (submitted)] Removing Seq. Effects [Royce et al., Bioinfo. '07] NN Prediction of Probe Intensity - measuring & exploiting specific
cross-hyb [Royce et al. NAR '07]
• Simulating NextGen Sequencing ChipSeqSim - simulating ChIP Seq [Zhang et al., PLoS CB '08]
Do not reproduce without permission 4 L
ec
ture
s.G
ers
tein
La
b.o
rg
(c)
20
07
Tools & Tech. Analyses for Genome Structural Variation
Breakptr - HMM-based Array Segmentation for CNV detection [Korbel et al., PNAS '07]
MSB - Mean-shift-based Array Segmentation for CNV detection with extension to sequencing [Wang et al. Gen. Res. (submitted)]
PEMer - Paired-end Mapping for SV Dectection with simulation calibration and breakpoint DB [Korbel et al., GenomeBiol. (submitted)]
Long-SV-Assembly Simulations [Du et al., Nat. Meth. (submitted)] SD-CNV-CORR - Approach for correlating the occurrence of CNVs
and SDs with genomic features (particularly repeats) [Kim et al., Genome Res. (submitted)]
Do not reproduce without permission 5 L
ec
ture
s.G
ers
tein
La
b.o
rg
(c)
20
07
5
(c
) M
ark
Ge
rste
in,
20
02
, Y
ale
, b
ioin
fo.m
bb
.ya
le.e
du
5
(c
) M
ark
Ge
rste
in,
20
02
, Y
ale
, b
ioin
fo.m
bb
.ya
le.e
du
A Starting Point: Noisy Raw Signal from Tiling Arrays (Transcription)
Joh
nso
n e
t a
l. (2
00
5)
TIG
, 2
1,
93
-10
2.
Li et al., PLOS one (2007)
Do not reproduce without permission 6 L
ec
ture
s.G
ers
tein
La
b.o
rg
(c)
20
07
6
(c
) M
ark
Ge
rste
in,
20
02
, Y
ale
, b
ioin
fo.m
bb
.ya
le.e
du
Perfect Match Specific Cross-hyb. Non-specific Cross-hyb.
Specific & Non-specific Cross-Hyb.• Perfect match (PM):
probe binding intended target• Specific cross-hyb.: probes binding non-PM
targets with a small number of mismatches• Non-specific cross-hyb.: probes binding
targets with many mismatches, due to general stickiness of oligos
Do not reproduce without permission 7 L
ec
ture
s.G
ers
tein
La
b.o
rg
(c)
20
07
Non-Specific Cross Hyb. (Sequence Effects)
Do not reproduce without permission 8 L
ec
ture
s.G
ers
tein
La
b.o
rg
(c)
20
07
8
(c
) M
ark
Ge
rste
in,
20
02
, Y
ale
, b
ioin
fo.m
bb
.ya
le.e
du
Creation of Standardized Datasets for
Quantifying Effect of Mismatches
[Seringhaus et al., BMC Genomics (in press)]
Types of Mispairs (probe on array is first)
No
rmal
ized
In
ten
sity
MM
M
M v
s. P
M
Yeast Human
GA
v A
G
CA
v A
C
GT
v T
G
CT
v T
C
Do not reproduce without permission 9 L
ec
ture
s.G
ers
tein
La
b.o
rg
(c)
20
07
Sourc
e:
Royce
, T.E
., e
t al (2
00
7),
Bio
info
rmati
cs, 23
, 9
88
-97
Avg. intensity of all background probes with a C at position 4
Avg. intensity of all background probes with a T at position 33
Observing Non-specific Cross-hyb. (Probe sequence effects)
Do not reproduce without permission 10
Le
ctu
res
.Ge
rste
inL
ab
.org
(c
) 2
00710
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
Iterated Quantile Normalization to
Correct for Non-specific Cross-hyb.
• Adapt Bolstad et al (2003) approach to tiling arrays
• Force distributions with a given nt at each position to be same
• Distributions at other positions now different so iterate
• Also, robust adaptation of Naef & Magnasco (2003)
T Royce et al (2007), Bioinformatics, 23, 988-97
Do not reproduce without permission 11
Le
ctu
res
.Ge
rste
inL
ab
.org
(c
) 2
007
Measuring Specific Cross-Hyb
Source: Royce, T.E., et al (2007), Nucleic Acids Res., 23, 98-97
Do not reproduce without permission 12
Le
ctu
res
.Ge
rste
inL
ab
.org
(c
) 2
00712
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
Proof of principle test to “exploit” this
Figure from http://www.members.cox.net/amgough/Fanconi-genetics-genetics-primer.htm
• Using Cheng et al. (2005), predict gene expression levels (and profiles across tissues) for genes on part of chr. #6
• ...Based on closest cross-hyb tiles on part of chr. #7
• Then compare to measured levels and profile on #6
Source: Royce, T.E., et al (2007), Nucleic Acids Res., 23, 98-97
Do not reproduce without permission 13
Le
ctu
res
.Ge
rste
inL
ab
.org
(c
) 2
007
Nearest Nbr Search on Virtual Tiling
Royce, T.E., et al (2007), Nucleic Acids Res., 23, 98-97
Do not reproduce without permission 14
Le
ctu
res
.Ge
rste
inL
ab
.org
(c
) 2
007
Agreement between predicted tile expression profile and actual one
Source: Royce, T.E., et al (2007), Nucleic Acids Res., 23, 98-97
• Correlated predicted profiles with the actual profiles of gene expression across cell lines
• Much more correlation than expected by chance (dist. centered on 0)
Do not reproduce without permission 15
Le
ctu
res
.Ge
rste
inL
ab
.org
(c
) 2
00715
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
Royce, T. E. et al. Nucl. Acids Res. 2007 35:e99
Very Strong ROC Curve: Most genes are accurately detected using
nearest-neighbor features' signals
• Illustrates great magnitude of cross-hyb. on hi-density arrays
• High feature density arrays inadvertently resurrecting generic n-mer concept (van Dam & Quake, 2003)
• Suggests that tiling arrays could be exploited to create universal arrays
• Gold std. set of known expressed genes. How well do we find. • A set of known positives was defined as the Refseq genes with at
least 75% transfrag coverage. A set of known negatives was constructed by permuting the sequences in the set of known positives. For various thresholds, sensitivity and specificity were computed and then plotted.
Do not reproduce without permission 16
Le
ctu
res
.Ge
rste
inL
ab
.org
(c
) 2
007
CEGS Informatics Credits
• Array Corrections J Rozowsky T Royce M Seringhaus
• PEMer, SD-CNV, BreakPtr P Kim J Korbel J Du X Mu A Abyzov N Carriero
• Experimental M Snyder S Weissman A Urban
Do not reproduce without permission 17
Le
ctu
res
.Ge
rste
inL
ab
.org
(c
) 2
007
Computational Methods for SV Characterization
Mark Gerstein
Do not reproduce without permission 18
Le
ctu
res
.Ge
rste
inL
ab
.org
(c
) 2
007
Computational Methods for SV Characterization
Segmenting Array CGH data
Building a PEM pipeline
Correlating SVs and SDs with Repeats
Do not reproduce without permission 19
Le
ctu
res
.Ge
rste
inL
ab
.org
(c
) 2
00719
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
http://breakptr.gersteinlab.orgKorbel*, Urban* et al., PNAS (2007)
-0.5
0
0.5
Flu
ores
cenc
e
log2
rat
io
NO T TO SCALEACGTGACAC ATAAGCACACCA ATTGCTTGAGGGACCT TAGGCACAGT TAAC ATGATAAGCACACCA ATTGCTTGAGGTGACDNA
sequence
BreakPtr HMM
• To get highest resolution on breakpoints need to smooth & segment the signal
• BreakPtr: prediction of breakpoints, dosage and cross-hybridization using a system based on Hidden Markov Models
Do not reproduce without permission 20
Le
ctu
res
.Ge
rste
inL
ab
.org
(c
) 2
00720
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
High resolution of tiling arrays allows statistical integration of nucleotide sequence patterns
>4-fold enrichment of the breakpoints of copy number variants near segmental duplications (SDs)[e.g. Sharp et al., Am. J. Hum. Genet. 2005; 77:78-88].
Do not reproduce without permission 21
Le
ctu
res
.Ge
rste
inL
ab
.org
(c
) 2
00721
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
BreakPtr statistically integrates array signal and DNA sequence signatures (using a discrete-valued bivariate HMM)
Korbel*, Urban* et al., PNAS (2007)
Transition A
Transition A’
Transition B
Transition B’
Duplication DeletionNormal
Sequence
Arr
ayva
lues
Sequence
Arra
yva
lues
Do not reproduce without permission 22
Le
ctu
res
.Ge
rste
inL
ab
.org
(c
) 2
00722
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
Gold standards
Training data
Parameter optimization
Model parameter estimation
Breakpointvalidation
[sequencing]
log (number of CNVs available for parameter estimation)10
10[core]
304[intermediate A]
1003[intermediate B]
2503[full]
SDs
norm
aliz
ed f
luor
esce
nt
inte
nsity
log
-ra
tios
2
Max
imum
num
ber
of p
aram
eter
s pe
r tr
ansi
tion
stat
e
‘Active’ approach for breakpoint identification: initial scoring with preliminary model, targeted validation (with sequencing),
retraining, and rescoring
HMM optimized iteratively (using Expectation Maximization, EM) Korbel*, Urban* et al., PNAS (2007)
CNV breakpoints sequenced in ~10 cases following BreakPtr analysis;
Median resolution <300 bp
No improvement in accuracy with higher resolution (9nt tiling)
Do not reproduce without permission 23
Le
ctu
res
.Ge
rste
inL
ab
.org
(c
) 2
00723
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
Moving Beyond Arrays, Computational Methods for Next-
Generation Sequencing:Paired End Mapping to Find SVs
Do not reproduce without permission 24
Le
ctu
res
.Ge
rste
inL
ab
.org
(c
) 2
00724
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
Overall Strategy for Analysis of
NextGen Seq. Data to Detect Structural Variants
[Korbel et al., Science ('07); Korbel et al., GenomeBiol. (submitted)]
Do not reproduce without permission 25
Le
ctu
res
.Ge
rste
inL
ab
.org
(c
) 2
00725
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
Simulation strategy
Simulation
Experiment454 sequencing [Korbel et al., GenomeBiol. (submitted)]
Do not reproduce without permission 26
Le
ctu
res
.Ge
rste
inL
ab
.org
(c
) 2
00726
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
Reconstruction efficiency at different coverage
[Korbel et al., GenomeBiol. (submitted)]
Do not reproduce without permission 27
Le
ctu
res
.Ge
rste
inL
ab
.org
(c
) 2
00727
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
Building a Database of Variants: Complexities
[Korbel et al., GenomeBiol. (submitted)]
Do not reproduce without permission 28
Le
ctu
res
.Ge
rste
inL
ab
.org
(c
) 2
00728
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
Analyzing Duplications in the Genome (SDs & CNVs)
pers. photo, see streams.gerstein.info
080907_SD_CNV_Slides_MBG_CEGS_PMK
29
SEGMENTAL DUPLCATIONS AND COPY NUMBER VARIANTS ARE RELATED PHENOMENA AND HAVE BEEN CREATED BY SEVERAL DIFFERENT MECHANISMS
Copy Number Variants (CNV) Segmental Duplications (SD)
Fixation
Intra-species variation Fixed mutations(differences to other species)
NAHR (Non-allelic homologous recombination)
Flanking repeat(e.g. Alu, LINE…)
NHEJ (Non-homologous-end-joining)
No (flanking) repeats. In some cases <4bp microhomologies
080907_SD_CNV_Slides_MBG_CEGS_PMK
30
PERFORM LARGE SCALE CORRELATION ANALYSIS TO DETECT REPEAT SIGNATURES OF SDs AND CNVs
Survey a range of genomic features
Count the number of features in each genomic bin (100kb)
Calculate correlations / enrichments using robust stats
1
2
3
…ATCAAGG CCGGAA…
Exact match
Local environment
If exact CNV breakpoints are known, we can calculate the enrichment of repeat elements relative to the genome or relative to the local environment
[Kim et al. Gen. Res. (submitted, '08), arxiv.org/abs/0709.4200v1 ]
080907_SD_CNV_Slides_MBG_CEGS_PMK
31
SDs ARE CORRELATED WITH ALUS AND OTHER SDs
0.080.09
0.120.14 0.14 0.13
Alu association with SDs by age
90-92% 92-94% 94-96% 96-98% 98-99% >99%
• The co-localization of Alu elements with SDs is highly significant.
• Older SDs have a much higher association with Alus than younger SDs.
• SDs can mediate NAHR and lead to the formation of CNVs
• Such mechanisms (“preferential attachment”) are well studied in physics and should leads a very skewed (“power-law”) distribution of SDs.
•Hotspots[Kim et al. Gen. Res. (submitted, '08), arxiv.org/abs/0709.4200v1 ]
f
Number of SDs in Genomic Bin
Oc
cu
rre
nc
e
080907_SD_CNV_Slides_MBG_CEGS_PMK
32
0.0480.0006
0.07390.0466
ASSOCIATIONS ARE DIFFERENT FOR SDs AND CNVs
Alu
0.92
CNV association with repeats
Microsatellite
<0.001
Pseudogenes
0.046
LINE
0.001
0.07
0.27
0.0940.21
Alu
<0.001
SD association with repeats
Microsatellite
<0.001
Pseudogenes
0.046
LINE
0.001
[Kim et al. Gen. Res. (submitted, '08), arxiv.org/abs/0709.4200v1 ]
CNVAssociationwith SDs
>99% SDs* CNVs
0.31
0.11
CNVs ARE LESS ASSOCIATED WITH SDs THAN THE GENERAL SD TREND
Do not reproduce without permission 33
Le
ctu
res
.Ge
rste
inL
ab
.org
(c
) 2
00733
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
uOldLow seq-ID (%)
YoungHigh seq-ID (%)
CNVs
Fixation Aging (~40Mya)
NAHR
NHEJ
SDs
AluSD
LINEMicrosatellite
SubtelomeresFragile sites
Alu Burst (40 MYA)
AFTER THE ALU BURST, THE IMPORTANCE OF ALU ELEMENTS FOR GENOME REARRANGEMENT DECLINED RAPIDLY
• About 40 million years ago there was a burst in retrotransposon activity
• The majority of Alu elements stem from that time
• This, in turn, led to rapid genome rearrangement via NAHR
• The resulting SDs, could create more SDs, but with Alu activity decaying, their creation slowed
[Kim et al. Gen. Res. (submitted, '08), arxiv.org/abs/0709.4200v1 ]
Do not reproduce without permission 34
Le
ctu
res
.Ge
rste
inL
ab
.org
(c
) 2
007
Future Directions
• Simulations of SV Assembly• Analysis of Split Reads• Detailed Analysis of SV and CNVs
with Genomic Features
Do not reproduce without permission 35
Le
ctu
res
.Ge
rste
inL
ab
.org
(c
) 2
007
CEGS Informatics Credits
• Array Corrections J Rozowsky T Royce M Seringhaus
• PEMer, SD-CNV, BreakPtr P Kim J Korbel J Du X Mu A Abyzov N Carriero
• Experimental M Snyder S Weissman A Urban