Cohen Eukaryotic Comparative Genomicscommunity.gep.wustl.edu/.../Cohen...Genomics.pdf · Barak...

7
12/23/2020 1 Eukaryotic Comparative Genomics Barak Cohen June 2018 GEP Alumni Workshop Last Update: 12/23/2020 1 Detecting Conserved Sequences Motoo Kimura Charles Darwin 2 Evolution of Neutral DNA AT G C C C G T T G G A A T T T T T T G GG T A A AT G C C C G T T G G A A T T T T T T G GG T A A A G A G A C A T T T A G A A G ************************* 3 Evolution of Non-Neutral DNA A * A TT TT T G G C G A C CC C A A A A A G G C T T AA C C A A TT TT T G G C G A C CC C A A A A A G G C T T AA C C ***************************** G C C C G G T G A T T G T 4 Multi-Species Alignment ATGTGGCGCAGCCTGTGCCAGCTGGACGATCGA ATGTAGCCTAGCCAGTGCCAGCTGGACGATCGA GTACATCGATAGCTTAGAATGCTGGACGATCTC GTACGTCGATAGCATAGAATGCTGGACGATCTC * * * * *********** 5 How to do Comparative Genomics 1. Choose species to analyze 2. Align sequences 3. Identify streches of highly conserved nucleotides 6

Transcript of Cohen Eukaryotic Comparative Genomicscommunity.gep.wustl.edu/.../Cohen...Genomics.pdf · Barak...

Page 1: Cohen Eukaryotic Comparative Genomicscommunity.gep.wustl.edu/.../Cohen...Genomics.pdf · Barak Cohen June 2018 GEP Alumni Workshop 1 Detecting Conserved Sequences Charles Darwin Motoo

12/23/2020

1

Eukaryotic Comparative Genomics

Barak Cohen

June 2018 GEP Alumni Workshop

Last Update: 12/23/2020

1

Detecting Conserved Sequences

Motoo KimuraCharles Darwin

2

Evolution of Neutral DNA

A T GC C CGT T GGA A TTT T TT G G GT AA

A T GC C CGT T GGA A TTT T TT G G GT AA

A

G

A G

A

C

AT TT

A

G

AA

G* * * * * * * * * * * * * * * * * * * * * * * * *

3

Evolution of Non-Neutral DNA

A

*

AT T T T TGGC G AC CCCA A AA AG GC T TA AC C

A AT T T T TGGC G AC CCCA A AA AG GC T TA AC C*****************************

G

C

CCG

G T

G

A T

T G

T

4

Multi-Species Alignment

ATGTGGCGCAGCCTGTGCCAGCTGGACGATCGA

ATGTAGCCTAGCCAGTGCCAGCTGGACGATCGA

GTACATCGATAGCTTAGAATGCTGGACGATCTC

GTACGTCGATAGCATAGAATGCTGGACGATCTC

* * * * ***********

5

How to do Comparative Genomics

1. Choose species to analyze2. Align sequences3. Identify streches of highly conserved

nucleotides

6

Page 2: Cohen Eukaryotic Comparative Genomicscommunity.gep.wustl.edu/.../Cohen...Genomics.pdf · Barak Cohen June 2018 GEP Alumni Workshop 1 Detecting Conserved Sequences Charles Darwin Motoo

12/23/2020

2

Choose species

closely relatedspecies

distantly relatedspecies

• Closely Related Species– align well– not many changes

• Distantly Related Species– hard to align– lots of changes

7

S.cerevisiae

S. paradoxus

S. bayanusS. pastorianus

S. servazziiS. unisporus

S. exiguusS. diarenensis

S. castellii

S. kluyveri

Kluyveromyces lactis

Schizosaccharomyces pombe

S. cariocanus

S. mikataeS. kudriavzevii

~10Mya

~20Mya

~150Mya

>350Mya

8

Case Study: Coding vs. Non-Coding

• Coding DNA- codes for protein- triplet code- open reading frame (ORF)- tend to be long (50-500 bp)- highly constrained

• Non-Coding DNA- regulatory functions- short (5-15 bp)- degenerate- variable spacing

ORFATG…. …TAA

9

CASE 1:Non-Coding

GAL4ATG… …TAA

10

S.cerevisiae

S. paradoxus

S. bayanusS. pastorianus

S. servazziiS. unisporus

S. exiguusS. diarenensis

S. castellii

S. kluyveri

Kluyveromyces lactis

Schizosaccharomyces pombe

S. cariocanus

S. mikataeS. kudriavzevii

~10Mya

~20Mya

~150Mya

>350Mya

11

paradoxus TCTTCTGAGACAGCATCACTTCTTCTTNTTTTTTACATAACTTATTCTTCTATAATTTTCcerevisiae TCCTTTGAGACAGCATTCGCCCAGTATTTTTTTTATTCTACA-AACCTTCTATAATTT-C

** * *********** * * ******* ** * ************ *

paradoxus AACGTATTTACATAGTTCTGTATCAGTTTAATCACCATAATATTGTTTTCCCTCAACTAAcerevisiae AAAGTATTTACATAATTCTGTATCAGTTTAATCACCATAATATCGTTTTCT-----TTGT

** *********** **************************** ****** *

paradoxus TGAATGCAATTAGATTTTCTTATTGTTCCCTCGCGGCTTTTTTTTGTTTTATAATCTATTcerevisiae TTAGTGCAATTAATTTTTCCTATTGTTACTTCG-GGCCTTTTTCTGTTTTATGAGCTATT

* * ******** ***** ******* * *** *** ***** ******** * *****

paradoxus TTTTCCGTCATTTCTTCCCCAGATTTCCAACTTCATCTCCAGATTGTGTCTATGTAATGCcerevisiae TTTTCCGTCATC-CTTCCCCAGATTTTCAGCTTCATCTCCAGATTGTGTCTACGTAATGC

*********** ************* ** ********************** *******

paradoxus ATGCTATCATATTGAGAAAAGATAGAGAAACAACCCTCCTGAAAAATGAAGCTACTGTCTcerevisiae ACGCCATCATTTTAAGAGAGGACAGAGAAGCAAGCCTCCTGAAAGATGAAGCTACTGTCT

* ** ***** ** *** * ** ****** *** ********** ***************

Closely-related sequences are uninformative

GAL4ATG…

12

Page 3: Cohen Eukaryotic Comparative Genomicscommunity.gep.wustl.edu/.../Cohen...Genomics.pdf · Barak Cohen June 2018 GEP Alumni Workshop 1 Detecting Conserved Sequences Charles Darwin Motoo

12/23/2020

3

S.cerevisiae

S. paradoxus

S. bayanusS. pastorianus

S. servazziiS. unisporus

S. exiguusS. diarenensis

S. castellii

S. kluyveri

Kluyveromyces lactis

Schizosaccharomyces pombe

S. cariocanus

S. mikataeS. kudriavzevii

~10Mya

~20Mya

~150Mya

>350Mya

13

Distantly-related sequences do not align

cerevisiae ACTTACCAT-CAAC-CATAGATGGGTAAAC---GGTTAGTAACTAGGAACACGATcastelli AGA-GTCAAACTTTTCGT—ATA--TATATATAATATGTCTGATTGCTGGTT---T

* ** * * * * * * * * *

Noncoding (Promoter)

GAL4ATG…

14

S.cerevisiae

S. paradoxus

S. bayanusS. pastorianus

S. servazziiS. unisporus

S. exiguusS. diarenensis

S. castellii

S. kluyveri

Kluyveromyces lactis

Schizosaccharomyces pombe

S. cariocanus

S. mikataeS. kudriavzevii

~10Mya

~20Mya

~150Mya

>350Mya

15

Multiple sequence alignments reveal conserved elements

cerevisiae TGAGACAGCAT-CACTTCTT-CTTNTTTTTTACATAACTTATTCTTCTATAATTTTCAACmikatae TGAGACAGCATTCACTTCTTTCTTTTTTTTTACATATCTTATTCTTCTATAATTTTCAACBayanus TGAGACAGCATTCGCCCAGT--ATTTTTTTTAT-TCTACAAACCTTCTATAATTT-CAAAkudriadzevi TGAGACTGCACTCCC--------TCTTCCTTTC------------TCCATAACTT---AC

****** *** * * * ** ** ** **** ** *

paradoxus GTATTTACATAGTTCTGTATCAGTTTAATCACCATAAT------ATTGTTTTCCCTCAACkluyveri GTATTTACATAGTTCTGTATCAGTTTAATCACCATAAT------ATTGTTTTCCCTCAACcerevisiae GTATTTACATAATTCTGTATCAGTTTAATCACCATAAT------ATCGTTTTCTTTGT--bayanus TTATTTACATAGTTTTGTATCAGTTTAATCACCATAATCGTAACACCGTTTTACCTCACC

********** ** *********************** * ***** *

paradoxus TAATGAATGCAATTAGATTTTC-TTATTGTTCCC-TCGCGGCTTTTTTTTGTTTTATAATkluyveri TAATGAATGCAATTAGATTTTCCTTATTGTTCCCCTCGCGGCTTTTTTTTGTTTTATAATcerevisiae ---TTAGTGCAATTAATTTTTC-CTATTGTTACT-TCG-GGCCTTTTTCTGTTTTATGAGbayanus TGATGCGGG--A---ATCCTTC-AGACCGTTCTC-TCGCGC-------------------

* * * *** * *** *** *

paradoxus -CTATTTTTTCCGTCATTTCTTCCCC-AGATTTCCAACTTCAT-CTCCAGATTGTGTCTAkluyveri ACTATTTTTTCCGTCATTTCTTCCCCCAGATTTCCAACTTCATACTCCAGATTGTGTCTAcerevisiae -CTATTTTTTCCGTCATC-CTTCCCC-AGATTTTCAGCTTCAT-CTCCAGATTGTGTCTAbayanus -CTTTTTTTTTCGTCATTTCTTCCCC-AGATCTACAACTTTAA-CTCCAGACGGTGTATA

** ****** ****** ******* **** * ** *** * ******* **** **

paradoxus TGTAATGCATGCTATCATATTGAGAAAAGATAGAGAAACAACCCTCCTGAAAAATGAAGCkluyveri TGTAATGCATGCTATCATATTGAGAAAAGATAGAGAAACAACCCTCCTGAAAAATGAAGCcerevisiae CGTAATGCACGCCATCATTTTAAGAGAGGACAGAGAAGCAAGCCTCCTGAAAGATGAAGCbayanus GGCAGTACAAGCAGTGCTTTTGGGAAGAGGCAAAGCTGCAGACCTCGAGAACAATGAAGC

* * * ** ** * * ** ** * * ** ** **** *** *******

UAS1 UAS2

UES MIG1 MIG1

GAL4ATG…

16

CASE 2:Coding

CLN3ATG… …TAA

17

S.cerevisiae

S. paradoxus

S. bayanusS. pastorianus

S. servazziiS. unisporus

S. exiguusS. diarenensis

S. castellii

S. kluyveri

Kluyveromyces lactis

Schizosaccharomyces pombe

S. cariocanus

S. mikataeS. kudriavzevii

~10Mya

~20Mya

~150Mya

>350Mya

18

Page 4: Cohen Eukaryotic Comparative Genomicscommunity.gep.wustl.edu/.../Cohen...Genomics.pdf · Barak Cohen June 2018 GEP Alumni Workshop 1 Detecting Conserved Sequences Charles Darwin Motoo

12/23/2020

4

Closely-related sequences are uninformative

19

S.cerevisiae

S. paradoxus

S. bayanusS. pastorianus

S. servazziiS. unisporus

S. exiguusS. diarenensis

S. castellii

S. kluyveri

Kluyveromyces lactis

Schizosaccharomyces pombe

S. cariocanus

S. mikataeS. kudriavzevii

~10Mya

~20Mya

~150Mya

>350Mya

20

Less distantly related species not informative either

21

S.cerevisiae

S. paradoxus

S. bayanusS. pastorianus

S. servazziiS. unisporus

S. exiguusS. diarenensis

S. castellii

S. kluyveri

Kluyveromyces lactis

Schizosaccharomyces pombe

S. cariocanus

S. mikataeS. kudriavzevii

~10Mya

~20Mya

~150Mya

>350Mya

22

Distantly-related species reveal functional protein domains

23

Identification of Multi-Species Conserved Regions (MCS)

Margulies et al (2003) Gen. Res. 13:2507-18

Human cccattcttttccaagtgtctccg--cctgcagcgattaggttagaaagcatttctctctChimp cccattcttttccaagtgtctccg--cctgcagcgattaggttagaaagcatttctctctMouse ttcagtcgtttcccagtgtctctga-cattcagagactactttagtaagcattt-tctctRat tcagtccttccctggcatctccag-cactcaa-gactactttagtaagcattt-tctctgDog tcaatgactttcccagtctcttctactgggaagagattaggttgcaaatcatttttctct

* * * * * * **

How can we decide if this region is “conserved?”

24

Page 5: Cohen Eukaryotic Comparative Genomicscommunity.gep.wustl.edu/.../Cohen...Genomics.pdf · Barak Cohen June 2018 GEP Alumni Workshop 1 Detecting Conserved Sequences Charles Darwin Motoo

12/23/2020

5

Its like flipping coins (really)

25

Binomial-Based Method for Detecting Conserved Sequences

p = probability that a site is the same between human and mouse by chance alone (Kimura), q = 1-p

For an alignment N base pairs long with n identities calculate the cumulative binomial probability as:

Margulies et al (2003) Gen. Res. 13:2507-18

Human: AATGGMouse: AATCGStatus: CCCDC

26

27 28

Large sequencing projects are underway

29

species A

species B

species Cspecies D

species E

species F

Star Phylogeny Actual Phylogeny

Tree Topology Influences Power

30

Page 6: Cohen Eukaryotic Comparative Genomicscommunity.gep.wustl.edu/.../Cohen...Genomics.pdf · Barak Cohen June 2018 GEP Alumni Workshop 1 Detecting Conserved Sequences Charles Darwin Motoo

12/23/2020

6

Challenges in larger genomes

1) Deciding on the neutral rate of substitution

2) Local differences in neutral rate of substitutions

3) Multiple hypothesis testing

4) Repeat sequences and uneven base composition

31

OLIG2

100 kb upstream of OLIG2

PhastCons and the UCSC Genome Browser

32

Gene 1 Gene 2 Gene 3 Gene NSpecies 1Species 2Species 3

Motif Searching Across Several Multiple Alignments

33

Information Content

GAATTCGAATTCGAATTCGAATTCGAATTCGAATTCGAATTC

EcoR1GCCTACACATTCTCATTCCGACTCGAATTCATATCGGAAATG

Random Rap1TGTATGGGTGTGTTCGGATTTGCATGGGTGTGTACAGGTGTGTATGGATGTGTTCGGGTTTGTATGGGTG

34

Weight Matrix Model of TATA Box

A: -8 10 -1 2 1 -8

C: -10 -9 -3 -2 -1 -12

G: -7 -9 -1 -1 -4 -9

T: 10 -6 9 0 -1 11

G. Stormo

35

A: -8 10 -1 2 1 -8

C: -10 -9 -3 -2 -1 -12

G: -7 -9 -1 -1 -4 -9

T: 10 -6 9 0 -1 11

….A C T A T A A T G T …

Score = -24

G. Stormo

Weight Matrix Model of TATA Box

36

Page 7: Cohen Eukaryotic Comparative Genomicscommunity.gep.wustl.edu/.../Cohen...Genomics.pdf · Barak Cohen June 2018 GEP Alumni Workshop 1 Detecting Conserved Sequences Charles Darwin Motoo

12/23/2020

7

A: -8 10 -1 2 1 -8

C: -10 -9 -3 -2 -1 -12

G: -7 -9 -1 -1 -4 -9

T: 10 -6 9 0 -1 11

….A C T A T A A T G T …

Score = 43

G. Stormo

Weight Matrix Model of TATA Box

37

N(b,i)

F(b,i)

S(b,i) = log[F(b,i)/P(b)]

G. Stormo

Weight Matrix Model of TATA Box

38

Now we can compare motifs to each other

4 -3 5 -6 -2 -52 -1 -2 11 -1 -1

-10 8 2 -4 2 -3-3 2 1 2 -3 15

3 -2 2 1 3 13 -1 -2 7 -2 -1-8 6 3 -2 2 -2-1 1 1 4 -3 9

ACGT

ACGT

39

MAGMAunaligned motif finding in multispecies conserved regions

Gene 1 Gene 2 Gene 3

*Ihuegbu, Stormo, & Buhler, JCB 19:139, 2012

Gene NSpecies 1Species 2Species 3

40