finding genes by comparing genomes
description
Transcript of finding genes by comparing genomes
![Page 1: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/1.jpg)
finding genes by comparing genomes
roderic guigó i serraimim/upf/crg, barcelona
![Page 2: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/2.jpg)
número de genes en el cromosoma 22
• initial annotation 545 Dunham et al., 1999
• genscan+RT-PCR 590 Das et al., 2001
• genscan+microarrays 730 Shoemaker et al., 2001
• reviewed annotation 726 chr22 team, sanger, 2001
• mouse shotgun data +20 (our data)
• geneid predictions 794
• genscan predictions 1128
![Page 3: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/3.jpg)
número de genes en el genoma humano
• Consortium 30.000-40.000 2001
• Celera 27.000-38.000 2001
• Consortium+Celera 50.000 Hogenesch et al.
2001
• DBsearches 65.000-75.000 Wrigth et al., 2001
• HumanGenomeSciences 90.000-120.000 Haseltine,
2001
![Page 4: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/4.jpg)
sequence conservation
andcoding function
![Page 5: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/5.jpg)
sequence conservation and coding function
![Page 6: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/6.jpg)
• rosseta (Batzoglou et al., 2000)
• cem (Bafna and Huson, 2000)
• sgp1 (Wiehe et al., 2000)
• twinscan (Korf et al., 2001)
• slam (Patcher et al., 2001)
• doublescan (Meyer and Durbin, 2002)
• sgp2 (Parra et al., 2003)
comparative gene prediciton
![Page 7: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/7.jpg)
comparative gene prediction
1. THE GENE PREDICTION IS THE RESULT OF THE SEQUENCE ALIGNMENT
given two homologous genomic sequences, infer the exonic structure in each sequence maximizing the score of the alignment of the resulting amino acid sequences.
This problem is usually solved through a complex extension of the classical dynamic programming algorithm for sequence alignment.
• blayo et al., 2002• pedersen and scharl, 2002
![Page 8: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/8.jpg)
comparative gene prediction
2. GENE PREDICTION AND SEQUENCE ALIGNMENT ARE PRODUCED SIMULTANIOUSLY
given two homologous genomic sequences, Pair hidden Markov Models for sequence alignment, and Generalized HMMs (GHMMs) for gene prediction are combined into the so-called Generalized Pair HMMs
• progen – novichkov et al., 2001 • slam – pachter et al, 2001• doublescan – meyer and durbin, 2002
![Page 9: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/9.jpg)
comparative gene prediction
3. GENE PREDICTION IS SEPARATED FROM SEQUENCE ALIGNMENT
first, the alignment is obtained between two homologous genomic sequences using some generic sequence alignment program, such as tblastx, sim4 or glass
then, gene structures are predicted that are compatible with this alignment, meaning that predicted exons fall in the aligned regions.
• rosseta – batzoglou et al., 2000• cem – bafna and huson, 2000• sgp-1 – wiehe et al., 2001
![Page 10: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/10.jpg)
comparative gene prediction
4. GENE PREDICTION IS (EVEN MORE) SEPARATED FROM SEQUENCE ALIGNMENT
This approach does not require the comparison of two homologous genomic sequencs. Rather, a query sequence from a target genome is compared against a collection of sequence from a second (informant, reference) genome and the results of the comparison are used to modify the scores of the exons produced by underlying ``ab initio'' gene prediction algorithms.
• twinscan – korf et al., 2001• sgp-2 – parra et al., 2003
![Page 11: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/11.jpg)
QuerySequence
tblastxHSPs
geneidExons
HSPsProjectio
ns
SGPExons
syntenic gene prediction (sgp2)
![Page 12: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/12.jpg)
programs based on mouse human genome sequence comparisons improve gene
predictions
sensitivity specificity
genscan 0.79 0.46
twinscan 0.80 0.62
SGP 0.79 0.66
Accuracy on human chromosome 22
![Page 13: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/13.jpg)
how accurate are the sgp predictionsnucleotide level
![Page 14: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/14.jpg)
how accurate are the sgp predictionsexon level
![Page 15: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/15.jpg)
gene predicition programs predict a large number of genes
TWINSCAN
SGP
48462 total 47055
17562 novel 21942
3171
multiexoniclongno low complexity
4543
954human ts
2217orphans
1560orphans
2983human sgp
317 637 2217 1560 1931 1052
intron aligned human ts orphans orphans human sgp intron aligned
231 482 1971
away from an ensembl 1417 1706 857
predictions in the mouse genome
![Page 16: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/16.jpg)
and a large number of novel genes ...
TWINSCAN
SGP
48462 total 47055
17562 novel 21942
3171
multiexoniclongno low complexity
4543
954human ts
2217orphans
1560orphans
2983human sgp
317 637 2217 1560 1931 1052
intron aligned human ts orphans orphans human sgp intron aligned
231 482 1971
away from an ensembl 1417 1706 857
predictions in the mouse genome
![Page 17: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/17.jpg)
...with exons...
TWINSCAN
SGP
48462 total 47055
17562 novel 21942
10987
3171
multiexoniclongno low complexity
12158
954human ts
2217orphans
1560orphans
2983human sgp
317 637 2217 1560 1931 1052
intron aligned human ts orphans orphans human sgp intron aligned
231 482 1971
away from an ensembl 1417 1706 857
predictions in the mouse genome
![Page 18: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/18.jpg)
that look fine proteins
TWINSCAN
SGP
48462 total 47055
17562 novel 21942
10987
3171
multiexoniclongno low complexity
12158
4543
954human ts
2217orphans
1560Orphans
2983human sgp
317 637 2217 1560 1931 1052
intron aligned human ts orphans orphans human sgp intron aligned
231 482 1971
away from an ensembl 1417 1706 857
predictions in the mouse genome
![Page 19: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/19.jpg)
almost every mouse gene has the human orthologue counterpart
TWINSCAN
SGP
48462 total 47055
17562 novel 21942
10987
3171
multiexoniclongno low complexity
12158
4543
954human ts
2217orphans
1560orphans
2983human sgp
317 637 2217 1560 1931 1052
intron aligned human ts orphans orphans human sgp intron aligned
predictions in the mouse genome
![Page 20: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/20.jpg)
|1b chr1_2213 MSTNICSFKDRCVSILCCKFCKQVLSSRGMKAVLLADTEIDLFSTDIPPTNAVDFTGRCY **** *:*******************************:************:*** **** chr1_1808 MSTNNCTFKDRCVSILCCKFCKQVLSSRGMKAVLLADTDIDLFSTDIPPTNTVDFIGRCY |1b
|2b |3a chr1_2213 FTKICKCKLKDIACLKCGNIVGYHVIVPCSSCLLSCNNRHFWMFHSQAVYDINRLDSTGV ** *********************************** ***********.*****:*** chr1_1808 FTGICKCKLKDIACLKCGNIVGYHVIVPCSSCLLSCNNGHFWMFHSQAVYGINRLDATGV |2b |3a
chr1_2213 NVLLRGNLPEIEESTDEDVLNISAEECIR *:** ***** **.***:.*:***** ** chr1_1808 NLLLWGNLPETEECTDEETLEISAEEYIR
orthologous human mouse genes have conserved exonic structure
![Page 21: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/21.jpg)
orthologous human mouse genes have conserved exonic structure.
• 85% of the orhologous pairs have identical number of exons
•91% of the orthologous exons have identical length
•99.5% of the orthologous exons have identical phase
• there are a few cases of intron insertion/deletion (22)
• U12 introns appear to be strongly conserved between human and mouse
• non-canonical GC-AG are less conserved.
data on 1506 human/mouse refseq orthologues
![Page 22: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/22.jpg)
we will target genes with conserved intron positions
|2a chr10_1592 LGSETCCNSHTSLQTSGVPDGSNNNSALIFITALQKMFTGFLLVNKSSCKLNPCWEKVQV * . ****:** ** ****** chr19_1200 ------------------------------------MRCSQEPVNKSACKSNPRWEKVQV |1a
chr10_1592 SSLYKLTDNCVNLQPLKRKEKKATLITLLSFTLHLLSSLAALRWDVNLPVNAVRKWMVQE *************************** ***:*************************** chr19_1200 SSLYKLTDNCVNLQPLKRKEKKATLITPLSFALHLLSSLAALRWDVNLPVNAVRKWMVQG
|3b chr10_1592 GQELEISISGGCLTFMGKSSSNSVITALLMAEELHHYDNFFYSCEPKSSLLFLLSRAVIE ************************************************************ chr19_1200 GQELEISISGGCLTFMGKSSSNSVITALLMAEELHHYDNFFYSCEPKSSLLFLLSRAVIE |2b
|4b chr10_1592 VCLYGV-LNSKVCQLQKVYILINTPVAWRSEGLADRWLPRKAQQASHLQHLVVGAREQAQ .**** . : :********************** ************ .** . .* chr19_1200 ACLYGENTAGPGLHSRKVYILINTPVAWRSEGLADRWLLRKAQQASHLQHLSAGATRAVQ |3c
![Page 23: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/23.jpg)
sequence conservation andcoding function
![Page 24: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/24.jpg)
ortholgous splice sites are more conserved than expected solely from their splicing
function
![Page 25: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/25.jpg)
ortholgous splice sites are more conserved than expected solely from their splicing
function
![Page 26: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/26.jpg)
prediction of splice sites
![Page 27: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/27.jpg)
we will target genes with conserved intron positions
![Page 28: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/28.jpg)
the final pools
TWINSCAN
SGP
48462 total 47055
17562 novel 21942
10987
3171
multiexoniclongno low complexity
12158
4543
954human ts
2217orphans
1560orphans
2983human sgp
317 637 2217 1560 1931 1052
intron aligned
human ts orphans orphans human sgp
intron aligned
predictions in the mouse genome
![Page 29: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/29.jpg)
rtpcr: targeting conserved intron positions
|2a chr10_1592 LGSETCCNSHTSLQTSGVPDGSNNNSALIFITALQKMFTGFLLVNKSSCKLNPCWEKVQV * . ****:** ** ****** chr19_1200 ------------------------------------MRCSQEPVNKSACKSNPRWEKVQV |1a
chr10_1592 SSLYKLTDNCVNLQPLKRKEKKATLITLLSFTLHLLSSLAALRWDVNLPVNAVRKWMVQE *************************** ***:*************************** chr19_1200 SSLYKLTDNCVNLQPLKRKEKKATLITPLSFALHLLSSLAALRWDVNLPVNAVRKWMVQG
|3b chr10_1592 GQELEISISGGCLTFMGKSSSNSVITALLMAEELHHYDNFFYSCEPKSSLLFLLSRAVIE ************************************************************ chr19_1200 GQELEISISGGCLTFMGKSSSNSVITALLMAEELHHYDNFFYSCEPKSSLLFLLSRAVIE |2b
|4b chr10_1592 VCLYGV-LNSKVCQLQKVYILINTPVAWRSEGLADRWLPRKAQQASHLQHLVVGAREQAQ .**** . : :********************** ************ .** . .* chr19_1200 ACLYGENTAGPGLHSRKVYILINTPVAWRSEGLADRWLLRKAQQASHLQHLSAGATRAVQ |3c
![Page 30: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/30.jpg)
rt-pcr on 12 normal mouse adult tissues,and direct sequencing of the amplimers
pool predictions
tested positive success rate
intron aligned
1428 214 133 62%
similar 2125 38 4 11%
orphan 3425 63 2 3%
![Page 31: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/31.jpg)
rt-pcr on 12 normal mouse adult tissues,and direct sequencing of the amplimers
![Page 32: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/32.jpg)
about 1000 human genes not in ensembl
• low support by ESTs: 34% match EST sequences
• low representation in other vertebrate genomes: 33% have sequence matches in fish genomes
• restricted expression patterns
![Page 33: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/33.jpg)
restricted expression patterns
![Page 34: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/34.jpg)
![Page 35: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/35.jpg)
![Page 36: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/36.jpg)
CodeB H K Y V S M L T K E O
%Id Homology
3B1 ● ●
38% Dystrophin-like; with ZZ domain
3B3 ● ● ● ● ●
25% Novel aquaporin; similar to Drosophila CG12251
3C3 ● ● ● ● ●
25% TEP1 (telomerase associated); probable ATPase
3C5 ● ●
47% Voltage-dependent calcium channel gamma subunit
4B3 ● ● ●
34% Interferon-induced / fragilis transmembrane family
4C6 ● ● ● ● ●
30% Interleukin 22-binding protein CRF2-10
4G4● ● ● ●
64% Nna1p, nuclear ATP/GTP-binding protein
5B5 ● ● ●
43% Likely aminophospholipid flippase (transporting ATPase)
1E3● ● ● ● ●
40% N-acetylated-α-linked-acidic dipeptidase (NAALADase)
6C4 ● ●
42% Not-type homeobox; poss. involved in notochord development
6F5● ● ●
66% Drosophila brain-specific homeobox protein (bsh)
11F2● ● ● ● ●
29% Human GABA-B receptor 2, neurotransmitter release regulator
5A2 ● ● ● ●
41% Skate liver organic solute transporter beta
11B6 ● ● ●
55% Interferon-activatable protein 203; nuclear protein
12B3● ● ● ● ● ● ● ●
25% Fatty acid desaturase; maintains membrane integrity
11F6● ● ● ● ● ● ●
44% Rat vanilloid receptor type 1 like protein 1
12E3 ● ●
52% Fizzy/CDC20; modulates degradation of cell-cycle proteins
12F1 ● ● ● ● ●
43% Otoferlin (mutated in DFNB9, nonsyndromic deafness)
12H1● ● ●
45% Fruitfly additional sex combs; a Polycomb group protein
12C4● ● ●
43% C. elegans C15C8.2; single-minded-like; HLH and PAS domains
12D2 ●
41% Cytosolic phospholipase A2, group IVB
12A5●
38% Fruitfly GH15686p; Ent2-like nucleoside transporter
12E5● ● ● ●
32% Relaxin 3 preproprotein; prohormone of the insulin family
11A1 ● ● ● ● ●
89% Mouse BET3, involved in ER to Golgi transport
11A2● ● ● ● ● ●
70% Vacuolar ATP synthase subunit S1
11B2 ● ● ● ● ● ●
54% Myosin light chain kinase, skeletal muscle.
11G236% Dapper / frodo (transduces Wnt signals by interacting with Dsh.
![Page 37: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/37.jpg)
limitations:sensitivity of the procedure
twisncan
ensembl sgp2
initial predictions 48464 23026 48451
multiexonic genes 36831 17565 38979
25320 16368 16952 21184
69% 94% 97% 54%
orhtolog pairs 24743 30927
21099 15355 16757 19831
85% 87% 95% 64%
intron aligned 17271 18056
16337 13709 15112 15977
94% 78% 86% 88%
![Page 38: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/38.jpg)
TS success
Dn
/Ds
0 1
0.15
0.20
0.25
SGP successD
n/D
s0 1
0.22
0.32
0.42
0.52
0.62
0.72
0.82
0.92
specificity of the prediction can be improved: Ka/Ks ratio
![Page 39: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/39.jpg)
further work
• scale the procedure. Try to find rtpcr evidence for (almost) every human gene not yet confirmed
• intronless genes
• human specific gene families (if any)
• genes with non-canonical splicing
![Page 40: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/40.jpg)
selenoproteins
Selenoproteins are proteins that incorporate the aminoacid selenocysteine, the 21st amino acid.
•Function: mostly redox enzymes
•Distribution: 3 domains of life
•Number: 22 families in mammals
![Page 41: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/41.jpg)
selenoproteins
• UGA (STOP) is the codon for Sec
• There is a tRNAsec with the UGA anticodon
• Recoding:
1. RNA structure: the SECIS element
2. SECIS binding proteins
![Page 42: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/42.jpg)
selenoproteins
![Page 43: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/43.jpg)
the SECIS element.computational search for selenoproteins
dSelG
SECIS Pattern
![Page 44: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/44.jpg)
using geneid to search for selenoproteins
1. Predict SECIS (PatScan)
1. Gene prediction with1. TGA in-frame2. SECIS
![Page 45: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/45.jpg)
genome wide search in drosophila
SECIS predicted 35876
SECIS thermo assessment
1220
Genes predicted 12194
Predicted Selenoproteins
(4)
RealSelenoproteins
3
![Page 46: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/46.jpg)
dSelG
![Page 47: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/47.jpg)
dSelM
![Page 48: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/48.jpg)
dSelG and dSelM: experimental verification
![Page 49: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/49.jpg)
dSelM has selenoprotein homologues in vertebrates
![Page 50: finding genes by comparing genomes](https://reader035.fdocuments.net/reader035/viewer/2022062807/56814fd8550346895dbd9e0b/html5/thumbnails/50.jpg)
IMIM/UPF/CRG Genís Parra, Josep F. Abril, Roderic Guigó
University of Geneva
Manolis Dermitzakis, Alexandre Reymond,Robert Lyle, Catherine Ucla, Stylianos Antonarakis
GlaxoSmithKline Pankaj Agarwal
University of Oxford
Chris Ponting
Washington University
Evan Keibler, Michael Brent
Universitat de BarcelonaUniversity of LinconHarvard University
Montserrat Corominas, Florenci Serras, Marta Morey, Sergi BertranVadim Gladishev, Gregory KruikovMarla Berry, Nadia Morozova
IMIM/UPF/CRG Sergi Castellano
COMPARATIVE GENE PREDICTION
SELENOPROTEINS