Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek...

22
Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics 2015 @mgymrek

description

eSTRs contribute to gene expression variability Observed p-value [-log10] Expected p-value under the null [-log10] Gene(TG) STR Expression Intro. STR catalogPSMCMutation processConclusion 10/29/15Melissa GymrekGenome Informatics 2015

Transcript of Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek...

Page 1: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics

Characterizing the short tandem repeat mutation process at every locus in the genome

Melissa GymrekGenome Informatics 2015

@mgymrek

Page 2: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics

Genetic variation comes in many forms

ACGACTCGAGCG

ACGACACGAGCG

μSNP: 1.20 × 10-8 /loc/gen

SNP

ACGACTCGAGCG

ACGAC-CGAGCGμINDEL: 0.68 × 10-9 /loc/gen

Short indel (1-20bp)

Short tandem repeat

CAGCAG---CAGCAGCA

CAGCAGCAGCAGCAGCA

μSTR: 10-2-10-5 /loc/gen

Alu retrotransposition

Alu

Struct. Var /CNV (>20bp)

STR 500

Alu 0.05

SV 0.2

Indel 3

SNP 50

# de novo/gen

STR 500

Alu 0.05

SV 0.2

Indel 3SNP 50

0

100

200

300

400

500

# de novo/gen

0

100

200

300

400

500

Intro.

STR catalog PSMC Mutation process Conclusion

10/29/15 Melissa GymrekGenome Informatics 2015

Page 3: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics

eSTRs contribute to gene expression variability

Obse

rved

p-v

alue

[-lo

g10]

Expected p-value under the null [-log10]

Gene(TG)

STR

Expr

essio

n

Intro.

STR catalog PSMC Mutation process Conclusion

10/29/15 Melissa GymrekGenome Informatics 2015

Page 4: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics

Why study the STR mutation process?

1. Identify rapidly mutating STRs

2. Understand biological processes driving mutation patterns

3. Identify STRs under selective pressure

Haasl and Payseur 2013

H0: Locus evolves under neutral modelH1: Locus is under selection

Intro.

STR catalog PSMC Mutation process Conclusion

10/29/15 Melissa GymrekGenome Informatics 2015

Page 5: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics

STRs and SNPs provide orthogonal molecular clocks

TIME

Clock 1: SNPs

Clock 2: STRs

# mismatches ~ f(μSNP, t, …)t

(m-n)2 ~ f(μSTR, t, …)

ACCCATCCTAGCTACCGACTACAACGACCGATCCTAGCTTCCGACTACCACGACACTCATCTG(CAG)mACACACTGAACACTCATCTG(CAG)nACACACTGA

Use known value of μSNP

to calibrate the STR molecular clock

μSTR: STR mutation rate (/loc/gen)

t: Time to the most recent common ancestor (TMRCA)μSNP: SNP mutation rate (/loc/gen)

Intro.

STR catalog PSMC Mutation process Conclusion

10/29/15 Melissa GymrekGenome Informatics 2015

Page 6: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics

Estimating STR mutation parameters from WGS

TMRCASTR calls

300 high coverage SGDP whole genomes

CAGm

CAGn

PSMC(Li and Durbin 2011)

SNPsTMRCA

Infer locus specific mutation params.

L

k

TMRCA

(m-n

)2

Step size

Freq

uenc

y

Learn model to predict mutation parameters from

sequence features

Diploid locus

Intro. STR catalog PSMC Mutation process Conclusion

10/29/15 Melissa GymrekGenome Informatics 2015

Page 7: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics

We are now armed with deep WGS amenable to STR profiling

SGDP: 300 deeply sequenced, PCR free genomes with diverse origins

Intro. STR catalog PSMC Mutation process Conclusion

10/29/15 Melissa GymrekGenome Informatics 2015

Page 8: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics

Generating high quality STR genotypes

Alignment

Sample 1 Sample 2 Sample n

Alignment Alignment

FASTQ FASTQ FASTQ

BAM BAM BAM

BW

A-M

EM

Allelotype(multi-sample)

lobSTRVCFFiltering

Intro. STR catalog PSMC Mutation process Conclusion

10/29/15 Melissa GymrekGenome Informatics 2015

lobSTR

Page 9: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics

High coverage genomes provide accurate STR genotypes

Homopolymers (n=50,398)

R2=0.92

93% concordance with capillary data

Accurately recover population structure

http://strcat.teamerlich.org/

Intro. STR catalog PSMC Mutation process Conclusion

10/29/15 Melissa GymrekGenome Informatics 2015

(e.g. AAAAAA)

Page 10: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics

Estimating STR mutation parameters from WGS

TMRCASTR calls

300 high coverage SGDP samples

CAGm

CAGn

PSMC(Li and Durbin 2011)

SNPsTMRCA

Infer locus specific mutation params.

L

k

TMRCA

(m-n

)2

Step size

Freq

uenc

y

Learn model to predict mutation parameters from

sequence features

Diploid locus

Intro. STR catalog PSMC Mutation process Conclusion

10/29/15 Melissa GymrekGenome Informatics 2015

Page 11: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics

Measuring TMRCA using PSMC

Dis

cret

ized

TM

RC

A

Li and Durbin, Nature 2011

Maternal chromosome

Paternal chromosome

CAGm

CAGn

Measure local TMRCA

Intro. STR catalog PSMC Mutation process Conclusion

10/29/15 Melissa GymrekGenome Informatics 2015

Page 12: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics

Estimating STR mutation parameters from WGS

TMRCASTR calls

300 high coverage SGDP samples

CAGm

CAGn

PSMC(Li and Durbin 2011)

SNPsTMRCA

Infer locus specific mutation params.

L

k

TMRCA

(m-n

)2

Step size

Freq

uenc

y

Learn model to predict mutation parameters from

sequence features

Diploid locus

Intro. STR catalog PSMC Mutation process Conclusion

10/29/15 Melissa GymrekGenome Informatics 2015

Page 13: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics

What we know about STR mutations

1. Mutate in “unit” lengths

2. Step size distribution ~Geometric

3. Length constraint biases mutation direction

4. Other important factors not modeled here

CAGCAGCAGCAGCAGCAGCAGCAG

CAGCAGCAG---CAGCAGCAGCAG

CAGCAG------CAGCAGCAGCAG

CAGCAG---CAGCAGCAGCAGCAG

CAGCAGCA-CAGCAGCAGCAGCAGSun et al. 2012

short alleles longer shorter longer • Length-dependent mutation rate

• Motif sequence interruptions

• Large expansions behave differently (e.g. Huntington’s)

• Biased gene conversion?

• Interaction between alleles?

P: probability of mutating a single step

3, 6 4, 4

4, 4

Intro. STR catalog PSMC Mutation process Conclusion

10/29/15 Melissa GymrekGenome Informatics 2015

Page 14: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics

Modeling STR mutation as a mean-centered random walk

Simple Stepwise Model (SMM): mutate by +/- 1 copy of the repeat unit with probability μ

t

CAGMRCA

CAGm CAGn

mm

n

Observed(Sun et al. 2012)

Mean-centered random walk (Ohrnstein-Uhlenbeck):

m

n

μSTR: Mutation rate(per generation)

β: Length constraint(0 ≤ β ≤ 1)

Intro. STR catalog PSMC Mutation process Conclusion

10/29/15 Melissa GymrekGenome Informatics 2015

β

Page 15: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics

Estimating the step size distribution

0 +5-5(Mean allele length)

1 2 3 4

0.2

0.4

0.6

0.8

Step size (# units)

Freq

uenc

y

+1 +2 +3 +4

Step size (# units)

0.1

0.2

0.3

0.4

Freq

uenc

y

-1-2-3-4

+1 +2 +3 +4

Step size (# units)

0.1

0.2

0.3

0.4

Freq

uenc

y

-1-2-3-4 +1 +2 +3 +4

Step size (# units)

0.1

0.2

0.3

0.4

Freq

uenc

y

-1-2-3-4p: Probability that the step size is a single unit.

Tetranucleotides: p = ~0.95Dinucleotides: p = ~0.7

Intro. STR catalog PSMC Mutation process Conclusion

10/29/15 Melissa GymrekGenome Informatics 2015

Page 16: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics

Model validation using Y-STRs

Thomas Willems

Find maximum likelihood mutation parameters(1000 Genomes Project):

P(STR data | Y phylogeny, μ, β, σ)

Validation set:Ballantyne et al (~2,000 father-son pairs)

Intro. STR catalog PSMC Mutation process Conclusion

10/29/15 Melissa GymrekGenome Informatics 2015

Ballantyne, et al.lo

bSTR

r=0.831, N=64

Page 17: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics

Estimating mutation parameters at autosomal loci

TMRCA

AS

D

0

4

9

16

CAG5

CAG5

Individual 1

CAG5

CAG8

Individual 2

CAGm

CAGn

Individual k

Intro. STR catalog PSMC Mutation process Conclusion

10/29/15 Melissa GymrekGenome Informatics 2015

Page 18: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics

Per-locus estimation of STR mutation parameters

Estimates for 120K multi-allelic STRs

Intro. STR catalog PSMC Mutation process Conclusion

10/29/15 Melissa GymrekGenome Informatics 2015

Page 19: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics

STR mutation trends by motif lengthIntro. STR catalog PSMC Mutation process Conclusion

10/29/15 Melissa GymrekGenome Informatics 2015

Page 20: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics

Future directions: a genome-wide scan for STR selection

Expected Observed

FeaturesMotif length Recomb. rate

Total length GC content

Linear model

Predict μ, β

Explain: 46% of variation in μ 4.6% of variation in β

Develop genome-wide scan STR selection scan

Intro. STR catalog PSMC Mutation process Conclusion

10/29/15 Melissa GymrekGenome Informatics 2015

Page 21: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics

Conclusion

The first genome-wide characterization of STR mutation

1. STR mutation model2. Validation against published de novo mutation rates3. Strong effect of local sequence features4. Future work: improve estimation, genome-wide selection

scan

An unexplored, important source of genetic variation

Intro. STR catalog PSMC Mutation process Conclusion

10/29/15 Melissa GymrekGenome Informatics 2015

Page 22: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics

Yaniv ErlichDavid ReichMark DalyNick PattersonSwapan MallickThomas WillemsAlon Goren

Acknowledgements