Array Informatics Mark Gerstein

Do not reproduce without permission 1 L

ec

ture

s.G

ers

tein

La

b.o

rg

(c)

20

07

Array Informatics

Mark Gerstein


ec

ture

s.G

ers

tein

La

b.o

rg

(c)

20

07

CEGS Informatics Developing Tools and Technical Analyses Related to

Genome Technologies

• Main Genome Technologies Tiling Arrays Next Generation Sequencing

• Main Applications Transcript mapping Protein-DNA Binding CGH

• Transitioning to Seq....


ec

ture

s.G

ers

tein

La

b.o

rg

(c)

20

07

Tools & Tech. Analyses for Processing of Genome Technology Data

• Normalizing Arrays and Measuring & Correcting Artifacts COP - Correcting positional artifacts [Yu et al. NAR '07] Efficient Pseudomedian Calculation - for Tiling Array Scoring

[Royce et al., BMC Bioinfo. '07] Measuring Mismatch Effects

[Seringhaus et al., BMC Genomics (submitted)] Removing Seq. Effects [Royce et al., Bioinfo. '07] NN Prediction of Probe Intensity - measuring & exploiting specific

cross-hyb [Royce et al. NAR '07]

• Simulating NextGen Sequencing ChipSeqSim - simulating ChIP Seq [Zhang et al., PLoS CB '08]


ec

ture

s.G

ers

tein

La

b.o

rg

(c)

20

07

Tools & Tech. Analyses for Genome Structural Variation

Breakptr - HMM-based Array Segmentation for CNV detection [Korbel et al., PNAS '07]

MSB - Mean-shift-based Array Segmentation for CNV detection with extension to sequencing [Wang et al. Gen. Res. (submitted)]

PEMer - Paired-end Mapping for SV Dectection with simulation calibration and breakpoint DB [Korbel et al., GenomeBiol. (submitted)]

Long-SV-Assembly Simulations [Du et al., Nat. Meth. (submitted)] SD-CNV-CORR - Approach for correlating the occurrence of CNVs

and SDs with genomic features (particularly repeats) [Kim et al., Genome Res. (submitted)]


ec

ture

s.G

ers

tein

La

b.o

rg

(c)

20

07

5

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

5

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

A Starting Point: Noisy Raw Signal from Tiling Arrays (Transcription)

Joh

nso

n e

t a

l. (2

00

5)

TIG

, 2

1,

93

-10

2.

Li et al., PLOS one (2007)


ec

ture

s.G

ers

tein

La

b.o

rg

(c)

20

07

6

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

Perfect Match Specific Cross-hyb. Non-specific Cross-hyb.

Specific & Non-specific Cross-Hyb.• Perfect match (PM):

probe binding intended target• Specific cross-hyb.: probes binding non-PM

targets with a small number of mismatches• Non-specific cross-hyb.: probes binding

targets with many mismatches, due to general stickiness of oligos


ec

ture

s.G

ers

tein

La

b.o

rg

(c)

20

07

Non-Specific Cross Hyb. (Sequence Effects)


ec

ture

s.G

ers

tein

La

b.o

rg

(c)

20

07

8

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

Creation of Standardized Datasets for

Quantifying Effect of Mismatches

[Seringhaus et al., BMC Genomics (in press)]

Types of Mispairs (probe on array is first)

No

rmal

ized

In

ten

sity

MM

M

M v

s. P

M

Yeast Human

GA

v A

G

CA

v A

C

GT

v T

G

CT

v T

C


ec

ture

s.G

ers

tein

La

b.o

rg

(c)

20

07

Sourc

e:

Royce

, T.E

., e

t al (2

00

7),

Bio

info

rmati

cs, 23

, 9

88

-97

Avg. intensity of all background probes with a C at position 4

Avg. intensity of all background probes with a T at position 33

Observing Non-specific Cross-hyb. (Probe sequence effects)

Do not reproduce without permission 10

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00710

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Iterated Quantile Normalization to

Correct for Non-specific Cross-hyb.

• Adapt Bolstad et al (2003) approach to tiling arrays

• Force distributions with a given nt at each position to be same

• Distributions at other positions now different so iterate

• Also, robust adaptation of Naef & Magnasco (2003)

T Royce et al (2007), Bioinformatics, 23, 988-97


Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

007

Measuring Specific Cross-Hyb

Source: Royce, T.E., et al (2007), Nucleic Acids Res., 23, 98-97


Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00712

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Proof of principle test to “exploit” this

Figure from http://www.members.cox.net/amgough/Fanconi-genetics-genetics-primer.htm

• Using Cheng et al. (2005), predict gene expression levels (and profiles across tissues) for genes on part of chr. #6

• ...Based on closest cross-hyb tiles on part of chr. #7

• Then compare to measured levels and profile on #6



Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

007

Nearest Nbr Search on Virtual Tiling

Royce, T.E., et al (2007), Nucleic Acids Res., 23, 98-97


Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

007

Agreement between predicted tile expression profile and actual one


• Correlated predicted profiles with the actual profiles of gene expression across cell lines

• Much more correlation than expected by chance (dist. centered on 0)


Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00715

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Royce, T. E. et al. Nucl. Acids Res. 2007 35:e99

Very Strong ROC Curve: Most genes are accurately detected using

nearest-neighbor features' signals

• Illustrates great magnitude of cross-hyb. on hi-density arrays

• High feature density arrays inadvertently resurrecting generic n-mer concept (van Dam & Quake, 2003)

• Suggests that tiling arrays could be exploited to create universal arrays

• Gold std. set of known expressed genes. How well do we find. • A set of known positives was defined as the Refseq genes with at

least 75% transfrag coverage. A set of known negatives was constructed by permuting the sequences in the set of known positives. For various thresholds, sensitivity and specificity were computed and then plotted.


Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

007

CEGS Informatics Credits

• Array Corrections J Rozowsky T Royce M Seringhaus

• PEMer, SD-CNV, BreakPtr P Kim J Korbel J Du X Mu A Abyzov N Carriero

• Experimental M Snyder S Weissman A Urban


Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

007

Computational Methods for SV Characterization

Mark Gerstein


Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

007

Computational Methods for SV Characterization

Segmenting Array CGH data

Building a PEM pipeline

Correlating SVs and SDs with Repeats


Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00719

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

http://breakptr.gersteinlab.orgKorbel*, Urban* et al., PNAS (2007)

-0.5

0

0.5

Flu

ores

cenc

e

log2

rat

io

NO T TO SCALEACGTGACAC ATAAGCACACCA ATTGCTTGAGGGACCT TAGGCACAGT TAAC ATGATAAGCACACCA ATTGCTTGAGGTGACDNA

sequence

BreakPtr HMM

• To get highest resolution on breakpoints need to smooth & segment the signal

• BreakPtr: prediction of breakpoints, dosage and cross-hybridization using a system based on Hidden Markov Models


Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00720

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

High resolution of tiling arrays allows statistical integration of nucleotide sequence patterns

>4-fold enrichment of the breakpoints of copy number variants near segmental duplications (SDs)[e.g. Sharp et al., Am. J. Hum. Genet. 2005; 77:78-88].


Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00721

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

BreakPtr statistically integrates array signal and DNA sequence signatures (using a discrete-valued bivariate HMM)

Korbel*, Urban* et al., PNAS (2007)

Transition A

Transition A’

Transition B

Transition B’

Duplication DeletionNormal

Sequence

Arr

ayva

lues

Sequence

Arra

yva

lues


Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00722

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Gold standards

Training data

Parameter optimization

Model parameter estimation

Breakpointvalidation

[sequencing]

log (number of CNVs available for parameter estimation)10

10[core]

304[intermediate A]

1003[intermediate B]

2503[full]

SDs

norm

aliz

ed f

luor

esce

nt

inte

nsity

log

-ra

tios

2

Max

imum

num

ber

of p

aram

eter

s pe

r tr

ansi

tion

stat

e

‘Active’ approach for breakpoint identification: initial scoring with preliminary model, targeted validation (with sequencing),

retraining, and rescoring

HMM optimized iteratively (using Expectation Maximization, EM) Korbel*, Urban* et al., PNAS (2007)

CNV breakpoints sequenced in ~10 cases following BreakPtr analysis;

Median resolution <300 bp

No improvement in accuracy with higher resolution (9nt tiling)


Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00723

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Moving Beyond Arrays, Computational Methods for Next-

Generation Sequencing:Paired End Mapping to Find SVs


Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00724

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Overall Strategy for Analysis of

NextGen Seq. Data to Detect Structural Variants

[Korbel et al., Science ('07); Korbel et al., GenomeBiol. (submitted)]


Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00725

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Simulation strategy

Simulation

Experiment454 sequencing [Korbel et al., GenomeBiol. (submitted)]


Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00726

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Reconstruction efficiency at different coverage

[Korbel et al., GenomeBiol. (submitted)]


Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00727

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Building a Database of Variants: Complexities

[Korbel et al., GenomeBiol. (submitted)]


Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00728

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Analyzing Duplications in the Genome (SDs & CNVs)

pers. photo, see streams.gerstein.info

080907_SD_CNV_Slides_MBG_CEGS_PMK

29

SEGMENTAL DUPLCATIONS AND COPY NUMBER VARIANTS ARE RELATED PHENOMENA AND HAVE BEEN CREATED BY SEVERAL DIFFERENT MECHANISMS

Copy Number Variants (CNV) Segmental Duplications (SD)

Fixation

Intra-species variation Fixed mutations(differences to other species)

NAHR (Non-allelic homologous recombination)

Flanking repeat(e.g. Alu, LINE…)

NHEJ (Non-homologous-end-joining)

No (flanking) repeats. In some cases <4bp microhomologies


30

PERFORM LARGE SCALE CORRELATION ANALYSIS TO DETECT REPEAT SIGNATURES OF SDs AND CNVs

Survey a range of genomic features

Count the number of features in each genomic bin (100kb)

Calculate correlations / enrichments using robust stats

1

2

3

…ATCAAGG CCGGAA…

Exact match

Local environment

If exact CNV breakpoints are known, we can calculate the enrichment of repeat elements relative to the genome or relative to the local environment

[Kim et al. Gen. Res. (submitted, '08), arxiv.org/abs/0709.4200v1 ]


31

SDs ARE CORRELATED WITH ALUS AND OTHER SDs

0.080.09

0.120.14 0.14 0.13

Alu association with SDs by age

90-92% 92-94% 94-96% 96-98% 98-99% >99%

• The co-localization of Alu elements with SDs is highly significant.

• Older SDs have a much higher association with Alus than younger SDs.

• SDs can mediate NAHR and lead to the formation of CNVs

• Such mechanisms (“preferential attachment”) are well studied in physics and should leads a very skewed (“power-law”) distribution of SDs.

•Hotspots[Kim et al. Gen. Res. (submitted, '08), arxiv.org/abs/0709.4200v1 ]

f

Number of SDs in Genomic Bin

Oc

cu

rre

nc

e


32

0.0480.0006

0.07390.0466

ASSOCIATIONS ARE DIFFERENT FOR SDs AND CNVs

Alu

0.92

CNV association with repeats

Microsatellite

<0.001

Pseudogenes

0.046

LINE

0.001

0.07

0.27

0.0940.21

Alu

<0.001

SD association with repeats

Microsatellite

<0.001

Pseudogenes

0.046

LINE

0.001


CNVAssociationwith SDs

>99% SDs* CNVs

0.31

0.11

CNVs ARE LESS ASSOCIATED WITH SDs THAN THE GENERAL SD TREND


Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00733

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

uOldLow seq-ID (%)

YoungHigh seq-ID (%)

CNVs

Fixation Aging (~40Mya)

NAHR

NHEJ

SDs

AluSD

LINEMicrosatellite

SubtelomeresFragile sites

Alu Burst (40 MYA)

AFTER THE ALU BURST, THE IMPORTANCE OF ALU ELEMENTS FOR GENOME REARRANGEMENT DECLINED RAPIDLY

• About 40 million years ago there was a burst in retrotransposon activity

• The majority of Alu elements stem from that time

• This, in turn, led to rapid genome rearrangement via NAHR

• The resulting SDs, could create more SDs, but with Alu activity decaying, their creation slowed



Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

007

Future Directions

• Simulations of SV Assembly• Analysis of Split Reads• Detailed Analysis of SV and CNVs

with Genomic Features


Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

007

CEGS Informatics Credits

• Array Corrections J Rozowsky T Royce M Seringhaus

• PEMer, SD-CNV, BreakPtr P Kim J Korbel J Du X Mu A Abyzov N Carriero

• Experimental M Snyder S Weissman A Urban

Array Informatics Mark Gerstein

Documents

Transcript of Array Informatics Mark Gerstein