Post on 02-Jan-2016
GENE
Exon 1 Intron Exon 3 Intron Exon 4Exon 2IntronPromoterEnhancer
mRNA transcript
Exon 1 Intron Exon 3 Intron Exon 4Exon 2Intron5’-untranslated
region
5’ 3’
Poly(A) signal
3’-untranslatedregion
Mataure mRNA
Transcription
Processing
The Organization of an Eukaryotic Gene
Exon 1 Exon 3 Exon 4Exon 23’(AAAAAA)n7-mG cap
start stop
5’
Find non-coding features of interest in the sequence
Gene identification involves Gene identification involves 4 main stages4 main stages
Determine the exon-intron organization
Identify the gene
Find the putative coding region(s) in the sequence
motif, signal and patternBlast, FASTAFunctional studies
CpG islandsTandemly and dispersed repeatsPromoter regions (TATA box, cap signal,CCAAT-box)Transcription factors, Poly-A sites
Branch point signalCT(G,A)A(C,T)
5’ and 3’ splice sites: AG/GUAAGU--------------PyPyPyPyPyPyPyPy-CAG/G
Open reading frame
Banbury Cross http://igs-server.cnrs-mrs.fr/igs/banburyFGENEH http://genomic.sanger.ac.uk/gf/gf.shtmlGeneID http://www1.imim.es/geneid.htmlGeneMachine http://genome.nhgri.nih.gov/genemachineGeneParser http://beagle.colorado.edu/_eesnyder/GeneParser.htlGENSCAN http://genes.mit.edu/GENSCAN.htmlGenotator http://www.fruitfly.org/_nomi/genotator/GRAIL http://compbio.ornl.gov/tools/index.shtmlGRAIL-EXP http://compbio.ornl.gov/grailexp/HMMgene http://www.cbs.dtu.dk/services/HMMgene/MZEF http://www.cshl.org/genefinderPROCRUSTES http://www-hto.usc.edu/software/procrustesRepeatMasker http://ftp.genome.washington.edu/RM/RepeatMasker.htmlSputnik http://rast.abajian.com/sputnik/
GENE FINDERS
GENSCAN Web Server at MIT \\|// (o o)-. .-. .-oOOo~(_)~oOOo-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X||' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-
Gene prediction programs:
Accuracy per nucleotide Accuracy per exonMethod Sn Sp AC Sn Sp (Sn+Sp)
ME WE/2
GENSCAN 0.93 0.93 0.91 0.78 0.81 0.80 0.09 0.05FGENEH 0.77 0.85 0.78 0.61 0.61 0.61 0.15 0.11 GeneID 0.63 0.81 0.67 0.44 0.45 0.45 0.28
0.24 GenePa2 0.66 0.79 0.66 0.35 0.39 0.37 0.29 0.17 GenLang 0.72 0.75 0.69 0.50 0.49 0.50 0.21 0.21 GRAILII 0.72 0.84 0.75 0.36 0.41 0.38 0.25 0.10 SORFIND 0.71 0.85 0.73 0.42 0.47 0.45 0.24 0.14 Xpound 0.61 0.82 0.68 0.15 0.17 0.16 0.32
0.13
GENSCAN Performance Data
Length Annotated exons Predicted exonsrange (bp) No. %Exact %Part %Miss No. %Exact %Part %Wrong
<= 24 89 38 8 52 44 77 11 1125 - 49 163 58 15 25 124 76 6 18 50 - 74 248 70 12 16 204 85 9 6 75 - 99 382 85 8 6 389 84 6 10100 - 124 351 84 9 7 366 81 8 11125 - 149 425 88 8 4 460 81 10 7150 - 174 261 88 9 2 283 81 11 7175 - 199 167 91 7 2 188 81 12 7200 - 299 353 90 8 1 390 82 8 8>= 300 211 66 19 1 204 69 20 10
Total 2650 81 10 8 2678 81 10 9
Accuracy as a Function of Exon Length
cDNA and genomic DNA alignment and matrix analysis:
GRAIL 210138 - 11018 +12608 - 12748 x13530 - 13923 x
GENSCAN10138 - 11018 +11268 - 11341 +11450 - 11518 +11644 - 11808 +11989 - 12144 +12360 - 12454 x12608 - 12748 x
FGENES1880 - 1908 x5061 - 5175 x5900 - 6049 x8317 - 8544 +10357 - 11018 +11268 - 11341 +11450 - 11518 +11644 - 11864 +polyA: 12521 +
What to do next?The predictions by these programs is just that: a prediction.
NEVER TRUST A COMPUTER!
Automatic sequencer
QuickTime™ and aPhoto - JPEG decompressor
are needed to see this picture.
One gene --
one promoter, one transcript, one protein.
Gene structure --
promoter ; exons ; introns
QuickTime™ and aPhoto - JPEG decompressor
are needed to see this picture.
DNA
RNA
Protein
GENE
Exon 1 Intron Exon 3 Intron Exon 4Exon 2IntronPromoterEnhancer
mRNA transcript
Exon 1 Intron Exon 3 Intron Exon 4Exon 2Intron5’-untranslated
region
5’ 3’
Poly(A) signal
3’-untranslatedregion
Mataure mRNA
Transcription
Processing
The Organization of an Eukaryotic Gene
Exon 1 Exon 3 Exon 4Exon 23’(AAAAAA)n7-mG cap
start stop
5’
Simple Mathematics:
Human Genome
3 x 10 9 bps
Human Genes (1.5% of the genome)
40,000 genesIn a given cell type at a certain stage, it is estimated that around 25 to 50 % of the genes are transcribed or expressed.
10,000 to 20,000 genes
40,000 x 35% x 5~10 splicing=70,000 ~ 140,000
+
40,000 x 65% =26,000
96,000 ~ 166,000
The subset of genes expressed in a given cell or tissue type such as the prostate may be defined as the transcriptome, the dynamic link between the genome, the proteome, and the cellular phenotype associated with physical characteristics.
Genome: DNA Sequence and Genes• SNPs• Splicing variants
Transcriptome: Entire mRNA Complement• Spatial/Temporal Expression• Aberrant expression profiles
Proteomics: Entire Protein Complement• Functional proteomics: profiling• Structural proteomics: 3-D structure• Protein interactions: genetic networks
Unknown sequence (http://www.wiley.com/legacy/products/subject/life/bioinformatics/questions_10.html)ATGGAGAATAGTCTTAGATGTGTTTGGGTACCCAAGCTGGCTTTTGTACTCTTCGGAGCTTCCTTGCTCA GCGCGCATCTTCAAGTAACCGGTTTTCAAATTAAAGCTTTCACAGCACTGCGCTTCCTCTCAGAACCTTCTGATGCCGTCACAATGCGGGGAGGAAATGTCCTCCTCGACTGCTCCGCGGAGTCCGACCGAGGAGTTCCAGTGATCAAGTGGAAGAAAGATGGCATTCATCTGGCCTTGGGAATGGATGAAAGGAAGCAGCAACTTTCAAATGGGTCTCTGCTGATACAAAACATACTTCATTCCAGACACCACAAGCCAGATGAGGGACTTTACCAATGTGAGGCATCTTTAGGAGATTCTGGCTCAATTATTAGTCGGACAGCAAAAGTTGCAGTAGCAGGACCACTGAGGTTCCTTTCACAGACAGAATCTGTCACAGCCTTCATGGGAGACACAGTGCTACTCAAGTGTGAAGTCATTGGGGAGCCCATGCCAACAATCCACTGGCAGAAGAACCAACAAGACCTGACTCCAATCCCAGGTGACTCCCGAGTGGTGGTCTTGCCCTCTGGAGCATTGCAGATCAGCCGACTCCAACCGGGGGACATTGGAATTTACCGATGCTCAGCTCGAAATCCAGCCAGCTCAAGAACAGGAAATGAAGCAGAAGTCAGAATTTTATCAGATCCAGGACTGCATAGACAGCTGTATTTTCTGCAAAGACCATCCAATGTAGTAGCCATTGAAGGAAAAGATGCTGTCCTGGAATGTTGTGTTTCTGGCTATCCTCCACCAAGTTTTACCTGGTTACGAGGCGAGGAAGTCATCCAACTCAGGTCTAAAAAGTATTCTTTATTGGGTGGAAGCAACTTGCTTATCTCCAATGTGACAGATGATGACAGTGGAATGTATACCTGTGTTGTCACATATAAAAATGAGAATATTAGTGCCTCTGCAGAGCTCACAGTCTTGGTTCCGCCATGGTTTTTAAATCATCCTTCCAACCTGTATGCCTATGAAAGCATGGATATTGAGTTTGAATGTACAGTCTCTGGAAAGCCTGTGCCCACTGTGAATTGGATGAAGAATGGAGATGTGGTCATTCCTAGTGATTATTTTCAGATAGTGGGAGGAAGCAACTTACGGATACTTGGGGTGGTGAAGTCAGATGAAGGCTTTTATCAATGTGTGGCTGAAAATGAGGCTGGAAATGCCCAGACCAGTGCACAGCTCATTGTCCCTAAGCCTGCAATCCCAAGCTCCAGTGTCCTCCCTTCGGCTCCCAGAGATGTGGTCCCTGTCTTGGTTTCCAGCCGATTTGTCCGTCTCAGCTGGCGCCCACCTGCAGAAGCGAAAGGGAACATTCAAACTTTCACGGTCTTTTTCTCCAGAGAAGGTGACAACAGGGAACGAGCATTGAATACAACACAGCCTGGGTCCCTTCAGCTCACTGTGGGAAACCTGAAGCCAGAAGCCATGTACACCTTTCGAGTTGTGGCTTACAATGAATGGGGACCGGGAGAGAGTTCTCAACCCATCAAGGTGGCCACACAGCCTGAGTTGCAAGTTCCAGGGCCAGTAGAAAACCTGCAAGCTGTATCTACCTCACCTACCTCAATTCTTATTACCTGGGAACCCCCTGCCTATGCAAACGGTCCAGTCCAAGGTTACAGATTGTTCTGCACTGAGGTGTCCACAGGAAAAGAACAGAATATAGAGGTTGATGGACTATCTTATAAACTGGAAGGCCTGAAAAAATTCACCGAATATAGTCTTCGATTCTTAGCTTATAATCGCTATGGTCCGGGCGTCTCTACTGATGATATAACAGTGGTTACACTTTCTGACGTGCCAAGTGCCCCGCCTCAGAACGTCTCCCTGGAAGTGGTCAATTCAAGAAGTATCAAAGTTAGCTGGCTGCCTCCTCCATCAGGAACACAAAATGGATTTATTACCGGCTATAAAATTCGACACAGAAAGACGACCCGCAGGGGTGAGATGGAAACACTGGAGCCAAACAACCTCTGGTACCTATTCACAGGACTGGAGAAAGGAAGTCAGTACAGTTTCCAGGTGTCAGCCATGACA
Find non-coding features of interest in the sequence
Gene identification involves Gene identification involves 4 main stages4 main stages
Determine the exon-intron organization
Identify the gene
Find the putative coding region(s) in the sequence
motif, signal and patternBlast, FASTAFunctional studies
CpG islandsTandemly and dispersed repeatsPromoter regions (TATA box, cap signal,CCAAT-box)Transcription factors, Poly-A sites
Branch point signalCT(G,A)A(C,T)
5’ and 3’ splice sites: AG/GUAAGU--------------PyPyPyPyPyPyPyPy-CAG/G
Open reading frame
Find non-coding features of interest in the sequence
Gene identification involves Gene identification involves 4 main stages4 main stages
Determine the exon-intron organization
Identify the gene
Find the putative coding region(s) in the sequence
motif, signal and patternBlast, FASTAFunctional studies
CpG islandsTandemly and dispersed repeatsPromoter regions (TATA box, cap signal,CCAAT-box)Transcription factors, Poly-A sites
Branch point signalCT(G,A)A(C,T)
5’ and 3’ splice sites: AG/GUAAGU--------------PyPyPyPyPyPyPyPy-CAG/G
Open reading frame
QuickTime™ and aPhoto - JPEG decompressor
are needed to see this picture.
TATA box
QuickTime™ and aPhoto - JPEG decompressor
are needed to see this picture.
QuickTime™ and aPhoto - JPEG decompressor
are needed to see this picture.
QuickTime™ and aPhoto - JPEG decompressor
are needed to see this picture.
QuickTime™ and aPhoto - JPEG decompressor
are needed to see this picture.
http://sullivan.bu.edu/~mfrith/HPD.html
http://www.epd.isb-sib.ch/
http://transfac.gbf.de/
40,000 x 35% x 5~10 splicing=70,000 ~ 140,000
+
40,000 x 65% =26,000
96,000 ~ 166,000
QuickTime™ and aPhoto - JPEG decompressor
are needed to see this picture.
QuickTime™ and aPhoto - JPEG decompressor
are needed to see this picture.
QuickTime™ and aPhoto - JPEG decompressor
are needed to see this picture.
QuickTime™ and aPhoto - JPEG decompressor
are needed to see this picture.
QuickTime™ and aPhoto - JPEG decompressor
are needed to see this picture.
QuickTime™ and aPhoto - JPEG decompressor
are needed to see this picture.
QuickTime™ and aPhoto - JPEG decompressor
are needed to see this picture.
QuickTime™ and aPhoto - JPEG decompressor
are needed to see this picture.
http://binfo.ym.edu.tw/passdb/index.html
http://cgsigma.cshl.org/new_alt_exon_db2/
http://www.bioinformatics.ucla.edu/HASDB/
Find non-coding features of interest in the sequence
Gene identification involves Gene identification involves 4 main stages4 main stages
Determine the exon-intron organization
Identify the gene
Find the putative coding region(s) in the sequence
motif, signal and patternBlast, FASTAFunctional studies
CpG islandsTandemly and dispersed repeatsPromoter regions (TATA box, cap signal,CCAAT-box)Transcription factors, Poly-A sites
Branch point signalCT(G,A)A(C,T)
5’ and 3’ splice sites: AG/GUAAGU--------------PyPyPyPyPyPyPyPy-CAG/G
Open reading frame
“Briefly, gene-finding strategies can be grouped into three major categories. Content-based methods rely on the overall, bulk properties of a sequence in making a determination. Characteristics considered here include how often particular codons are used, the periodicity of repeats, and the compositional complexity of the sequence. Because different organisms use synonymous codons with different frequency, such clues can provide insight into determining regions that are more likely to be exons. In site-based methods, the focus turns to the presence or absence of a specific sequence, pattern, or consensus. These methods are used to detect features such as donor and acceptor splice sites, binding sites for transcription factors, polyA tracts, and start and stop codons. Finally, comparative methods make determinations based on sequence homology. Here, translated sequences are subjected to database searches against protein sequences (cf. Chapter 8) to determine whether a previously characterized coding region corresponds to a region in the query sequence.”
Ab Initio Gene Discovery
Protein coding sequences within a whole genome sequence can be identified using a process known as ab initio gene discovery, in which software that recognizes features common to protein coding transcripts is used.– The existence of long open reading frames– Particularly ones for which the codon bias is typical of that
observed for the species being studied– Proximity of transcriptional and Translational initiation motifs and
3’ polyadenylation site– Splicing consensus sequences at putative intron-exon boundaries
Hidden Markov Models
An example of the HMM
TTACTTGACGCCAGAAATCTATATTTGGTAACCCGACGCTAA
NNNNNNNNNRRRRRNNNNNNNNNNNNNNNNNRRRRRRRRNNNNormal region GC-rich region
A T G C Mean length
-----------------------------
N 0.3 0.3 0.2 0.2 10
R 0.1 0.1 0.4 0.4 5
Hidden Markov Models (cont.)
Pr(sequence)=Pr(sequence|hidden states)Pr(hidden states)
Pr(TGCC)=Pr(TGCC|NNNN) Pr(NNNN)+Pr(TGCC|NNNR) Pr(NNNR)+Pr(TGCC|NNRN) Pr(NNRN)+Pr(TGCC|NRNN) Pr(NRNN)+Pr(TGCC|NNRR) Pr(NNRR)+Pr(TGCC|NRNR) Pr(NRNR)+Pr(TGCC|NRRN) Pr(NRRN)+Pr(TGCC|NRRR) Pr(NRRR)
Pr(TGCC|NNNN) Pr(NNNN)
=Pr(T|N)Pr(G|N)Pr(C|N)Pr(C|N)×Pr(N-N)Pr(N-N)Pr(N-N)
=(0.3×0.2×0.2×0.2) × (0.9× 0.9× 0.9)=0.00175
http://genes.mit.edu/GENSCAN.html
http://compbio.ornl.gov/grailexp/
Predicts GenScan GRAIL MZEF HMMgen
1 10138-11018
10138-11022
11464-11518
10138-11018
2 11268-11341
12608-12711
12024-12079
11268-11341
3 11450-11518
13530-13923
13530-13892
11450-11518
4 11644-11808
15698-15771
14980-15052
11644-11808
5 11989-12144
16358-16532
16358-16451
15002-15052
cDNA and genomic DNA alignment and matrix analysis:
GRAIL 210138 - 11018 +12608 - 12748 x13530 - 13923 x
GENSCAN10138 - 11018 +11268 - 11341 +11450 - 11518 +11644 - 11808 +11989 - 12144 +12360 - 12454 x12608 - 12748 x
FGENES1880 - 1908 x5061 - 5175 x5900 - 6049 x8317 - 8544 +10357 - 11018 +11268 - 11341 +11450 - 11518 +11644 - 11864 +polyA: 12521 +
http://research.nhgri.nih.gov/genemachine/
http://genome.ucsc.edu/index.html
Sequence manipulationORF Searching
Mapping (restriction sites)
Mapping (transcription factors)
ReverseFramesMapTranslateMap (-minc)(-maxc)Mapsort(-exclude)(-digest)Mapplot
Map tfsites
+++++++++++
+
GCG SeqWEBFunction Command
++++++++--+
-
Programs used in this exercise:(1) Sequence manipulation – reverse(3) ORF Searching – frames , map , translate(4) Mapping (restriction sites) – map (-minc, -maxc), mapsort(-exclude, -digest), mapplot, plasmidmap(5) Mapping (transcription factor) – map(tfsites).
Sequences used in this exercise:gb:z18853 (C.elegans mRNA for capping protein alpha subunit.)
cds:10-858gb:x03795 (Human mRNA for platelet derived growth factor A-chain, PDGF-A)
cds:388-1020.
Exercise89-10
cDNA and genomic DNA alignment and matrix analysis:
GRAIL 210138 - 11018 +12608 - 12748 x13530 - 13923 x
GENSCAN10138 - 11018 +11268 - 11341 +11450 - 11518 +11644 - 11808 +11989 - 12144 +12360 - 12454 x12608 - 12748 x
FGENES1880 - 1908 x5061 - 5175 x5900 - 6049 x8317 - 8544 +10357 - 11018 +11268 - 11341 +11450 - 11518 +11644 - 11864 +polyA: 12521 +
Gene Expression Studies
EST:Expressed Sequences TagsdbEST is a division of GenBank that contains sequence data and other information on "single-pass" cDNA sequences, or Expressed Sequence Tags, from a number of organisms.
In silico cloning:In order to perform an electronic cDNA library screen, the EST
sequences retrieved in this way can be used as queries in a BLASTN search of dbEST to identify over-lapping ESTs. This procedure can be reiterated with the newly identified ESTs until no additional hits are found. The ESTs isolated can be assembled into sequence contigs
using computer softwares, such as UniGene.
Query
EST 2EST 1EST 3
Full length mRNA sequences
In silico cloning:In order to perform an electronic cDNA library screen, the EST
sequences retrieved in this way can be used as queries in a BLASTN search of dbEST to identify over-lapping ESTs. This procedure can be reiterated with the newly identified ESTs until no additional hits are found. The ESTs isolated can be assembled into sequence contigs
using computer softwares, such as UniGene.
There are many sequencing related errors in the dbEST.
EST 2
EST 1EST 3
C. elegansa. a. sequences
Human EST sequences
CGI-Comparative Gene Identification
Ortholog:Homologous genes that have diverged from each other after speciation events (e.g., human beta- and chimp beta-globin)
Genomic sequence of the Nematode C. elegnas:A platform for investigating biology
The C. elegans Squencing Consortium
97 MB257 YACs (20% only in YAC)2527 cosmids113 fosmids44 PCR19,099 predicted genes18,891 proteins here(16,260 reviewed)
EST: 67,815 EST from 40,379 clones
7432 genes
A multicellular organism genome
Read protein sequences from dataset (eg. C. elegans proteome)
Perform tblastn BLAST search against HGI or EST databases
Parse BLAST results and stored in Oracle database
Rules based Neural Network
Predictions: Known genes, Gene Family, New Genes, No match
[THC195737---------------------------------------------
MTRHGKNSTAASVYTYHERRRDAKASGYGTLHARLGADSIKEFHCCSLTLQPCRNPVISPTGYIF
--------]
DREAILENILAQKKAYAKKLKEYEKQVAEESAAAKIAEGQAETFTKRTQFSAIESTPSRTGAVAT
[THC195737--------------------
PRPEVGSLKRQGGVMSTEIAAKVKAHGEEGVMSNMKGDKSTSLPSFWIPELNPTAVATKLEKPSS
----------------------------------------------------]
KVLCPVSGKPIKLKELLEVKFTPMPGTETAAHRKFLCPVTRDELTNTTRCAYLKKSKSVVKYDVV
[THC195737----------------------]
EKLIKGDGIDPINGEPMSEDDIIELQRGGTGYSATNETKAKLIRPQLELQ*
U58746
*nucleotide sequence error
Translation of 1 MTRHGKNCTAGAVYTYHEKKKDTAASGYGTQNIRLSRDAVKDFDCCCLSLQPCHD 55U58746 1 MTRHGKNSTAASVYTYHERRRDAKASGYGTLHARLGADSIKEFHCCSLTLQPCRN 55 *******.** .******...*. ****** . ** *..*.* **.*.****.
Translation of 56 PVVTPDGYLYEREAILEYILHQKKEIARQMKAYEKQRGTRREEQKELQRAASQDH 110U58746 56 PVISPTGYIFDREAILENILAQKKAYAKKLKEYEKQVAEESAAAKIAEGQAETFT 110 **..* **...****** ** *** *...* **** * . *
Translation of 111 VRGFLEKESAIVSRPLNPFTAKALSGTSPD-----------DVQPGPSVGPPSKD 154U58746 111 KRTQFSAIESTPSRTGAVATPRPEVGSLKRQGGVMSTEIAAKVKAHGEEGVMSNM 165 * . ** * . *. *. * *
Translation of 155 K-DK--VLPSFWIPSLTPEAKATKLEKPSRTVTCPMSGKPLRMSDLTPVHFTPLD 206U58746 166 KGDKSTSLPSFWIPELNPTAVATKLEKPSSKVLCPVSGKPIKLKELLEVKFTPMP 220 * ** ******* *.* * ******** * **.****... .* *.***.
Translation of 207 SSVDRVGLITRSER-YVCAVTRDSLSNATPCAVLRPSGAVVTLECVEKLIRKDMV 260U58746 221 ------GTETAAHRKFLCPVTRDELTNTTRCAYLKKSKSVVKYDVVEKLIKGDGI 269 * * . * ..* **** *.*.* ** *. * .** . *****. * .
Translation of 261 DPVTGDKLTDRDIIVLQRGGTGFAGSGVKLQAEKSRPVMQA 301U58746 270 DPINGEPMSEDDIIELQRGGTGYSAT-NETKAKLIRPQLELQ 310 **..*. ... *** *******.. . .* ** ..
(44%/59%)
[THC171302--MVFGENQDLIRTHFQKEADKVRAMKTNWGLFTRTRMIAQSDYDFIVTYQQAENEAERSTVLSVFKEK-------------------------------------------------------------------AVYAFVHLMSQISKDDYVRYTLTLIDDMLREDVTRTIIFEDVAVLLKRSPFSFFMGLLHRQDQYIVH-------------------------------------------------------------------ITFSILTKMAVFGNIKLSGDELDYCMGSLKEAMNRGTNNDYIVTAVRCMQTLFRFDPYRVSFVNING-------------------------------------------------------------------YDSLTHALYSTRKCGFQIQYQIIFCMWLLTFNGHAAEVALSGNLIQTISGILGNCQKEKVIRIVVST-----------------] [THC177150--------------------------------------------LRNLITSNQDVYMKKQAALQMIQNRIPTKLDHLENRKFTDVDLVEDMVYLQTELKKVVQVLTSFDEY-------------------------------------------------------------------ENELRQGSLHWSPAHKCEVFWNENAHRLNDNRQELLKLLVAMLEKSNDPLVLCVAAHDIGEFVRYYP------------------------------------------------]RGKLKVEQLGGKEAMMRLLTVKDPNVRYHALLAAQKLMINNWKDLGLEI
U50199
gi|2895578 (AF041338) vacuolar proton pump subunit SFD alpha is... 927 0.0gi|2895576 (AF041337) vacuolar proton pump subunit SFD beta iso... 885 0.0gi|1213557 (U50199) coded for by C. elegans cDNA yk89e9.5; code... 468 e-131gi|1086810 (U41109) similar to S. cerevisiae vacular H(+)-ATPas... 335 5e-91gnl|PID|e351278 (Z99532) hypothetical protein [Schizosaccharomy... 185 5e-46sp|P41807|VM13_YEAST VACUOLAR ATP SYNTHASE 54 KD SUBUNIT (V-ATP... 123 2e-27
gi|1213557 (U50199) coded for by C. elegans cDNA yk89e9.5; coded for by C. elegans cDNA cm7g5; coded for by C. elegans cDNA cm14b9; coded for by C. elegans cDNA yk52g5.5; coded for by C. elegans cDNA yk76e5.5; coded for by C. elegans cDNA yk131f11.5; c... Length = 470 Score = 468 bits (1192), Expect = e-131 Identities = 243/477 (50%), Positives = 314/477 (64%), Gaps = 20/477 (4%)
Human gene: 483 aa
gi|2895578 (AF041338) vacuolar proton pump subunit SFD alpha isoform [Bos taurus] Length = 483 Score = 927 bits (2369), Expect = 0.0 Identities = 460/483 (95%), Positives = 465/483 (96%)
Query: 1 MTKMDIRGAVDAAVPTNIIAAKAAEVRANKVNWQSYLQGQMISAEDCEFIQRFEMKRSPE 60 MTKMDIRGAVDAAVPTNIIAAKAAEVRANKVNWQSYLQGQMIS+EDCEFIQRFEMKRSPESbjct: 1 MTKMDIRGAVDAAVPTNIIAAKAAEVRANKVNWQSYLQGQMISSEDCEFIQRFEMKRSPE 60
Query: 61 EKQEMLQTEGSQCAKTFINLMTHICKEQTVQYILTMVDDMLQENHQRVSIFFDYARCSKN 120 EKQEMLQTEGSQ AKTFINLMTHI KEQTVQYILT+VDD LQENHQRVSIFFDYA+ SKNSbjct: 61 EKQEMLQTEGSQRAKTFINLMTHISKEQTVQYILTLVDDTLQENHQRVSIFFDYAKRSKN 120
Query: 121 TAWPYFLPILNRQDPFTVHMAARIIAKLAAWGKELMEGSDLNYYFNWIKTQLSSQKLRGS 180 TAW YFLP+LNRQD FTVHM ARIIAKLAAWGKELMEGSDLNYYFNWIKTQLSSQKLRGSSbjct: 121 TAWSYFLPMLNRQDLFTVHMTARIIAKLAAWGKELMEGSDLNYYFNWIKTQLSSQKLRGS 180
Query: 181 GVAVETGTVSSSDSSQYVQCVAGCLQLMLRVNEYRFAWVEADGVNCIMGVLSNKCGFQLQ 240 GV ETGTVSSSDSSQYVQCVAGCLQLMLRVNEYRFAWVEADGVNCIMGVLSNKCGFQLQSbjct: 181 GVTAETGTVSSSDSSQYVQCVAGCLQLMLRVNEYRFAWVEADGVNCIMGVLSNKCGFQLQ 240
Query: 241 YQMIFSIWLLAFSPQMCEHLRRYNIIPVLSDILQESVKEKVTRIILAAFRNFLEKSTERE 300 YQMIFS+WLLAFSPQMCEHLRRYNIIPVLSDILQESVKEKVTRIILAAFRNFLEKS ERESbjct: 241 YQMIFSVWLLAFSPQMCEHLRRYNIIPVLSDILQESVKEKVTRIILAAFRNFLEKSVERE 300
Query: 301 TRQEYALAMIQCKVLKQLENLEQQKYDDEDISEDIKFLLEKLGESVQDLSSFDEYSSELK 360 TRQEYALAMIQCKVLKQLENLEQQKYDDEDISEDIKFLLEKLGESVQDLSSFDEYSSELKSbjct: 301 TRQEYALAMIQCKVLKQLENLEQQKYDDEDISEDIKFLLEKLGESVQDLSSFDEYSSELK 360
Query: 361 SGRLEWSPVHKSEKFWRENAVRLNEKNYELLKILTKLLEVSDDPQXLAVAAHDVGXYVRX 420 SGRLEWSPVHKSEKFWREN RLNEKNYELLKILTKLLEVSDDPQ LAVAAHDVG YVR Sbjct: 361 SGRLEWSPVHKSEKFWRENPARLNEKNYELLKILTKLLEVSDDPQVLAVAAHDVGEYVRH 420
Query: 421 YPRGKRVIEQXGGKQLVMNHMHHEXQQVRYNALLAVQKLMVHNWEYLGKQLQSEQPQTXA 480 YPRGKRVIEQ GGKQLVMNHMHHE QQVRYNALLAVQKLMVHNWEYLGKQLQSEQPQT ASbjct: 421 YPRGKRVIEQLGGKQLVMNHMHHEDQQVRYNALLAVQKLMVHNWEYLGKQLQSEQPQTAA 480
Query: 481 ARS 483 ARSSbjct: 481 ARS 483
[AA134689-----------------------------------------------MSLNGFGEHTRSASHAGSWYNANQRDLDRQLTKWLDNAGPRIGTARALISPHAGYSYCGETAAYAF--------------------------]KQVVSSAVERVFILGPSHVVALNGCAITTCSKYRTPLGDLIVDHKINEELRATRHFDLMDRRDEES [THC196496-------------------------------------EHSIEMQLPFIAKVMGSKRYTIVPVLVGSLPGSRQQTYGNIFAHYMEDPRNLFVISSDFCHWGERF------------------------------------------------------------------SFSPYDRHSSIPIYEQITNMDKQGMSAIETLNPAAFNDYLKKTQNTICGRNPILIMLQAAEHFRIS-----------------------------------]NNHTHEFRFLHYTQSNKVRSSVDSSVSYASGVLFVHPN
U64857
Translation of 1 MSNR---VVCREASHAGSWYTASGPQLNAQLEGWLSQVQSTKRPARAIIAPHAGY 52U64857 1 MSLNGFGEHTRSASHAGSWYNANQRDLDRQLTKWLDNAGPRIGTARALISPHAGY 55 ** .* ********.* * ** ** . ***.*.*****
Translation of 53 TYCGSCAAHAYKQVDPSITRRIFILGPSHHVPLSRCALSSVDIYRTPLYDLRIDQ 107U64857 56 SYCGETAAYAFKQVVSSAVERVFILGPSHVVALNGCAITTCSKYRTPLGDLIVDH 110 .*** .** *.*** * *.******* * * **... ***** ** .*.
Translation of 108 KIYGELWKTGMFERMSLQTDEDEHSIEMHLPYTAKAMESHKDEFTIIPVLVGALS 162U64857 111 KINEELRATRHFDLMDRRDEESEHSIEMQLPFIAKVMGSKR--YTIVPVLVGSLP 163 ** ** * *. * . .* ******.**. ** * *.. .**.*****.*
Translation of 163 ESKEQEFGKLFSKYLADPSNLFVVSSDFCHWGQRFRYSYYD-ESQGEIYRSIEHL 216U64857 164 GSRQQTYGNIFAHYMEDPRNLFVISSDFCHWGERFSFSPYDRHSSIPIYEQITNM 218 *..* .* .*..*. ** ****.********.** .* ** * ** * ..
Translation of 217 DKMGMSIIEQLDPVSFSNYLKKYHNTICGRHPIGVLLNAITELQK-NGMNMSFSF 270U64857 219 DKQGMSAIETLNPAAFNDYLKKTQNTICGRNPILIMLQAAEHFRISNNHTHEFRF 273 ** *** ** * * .* **** .******.** ..*.* . *. . * *
Translation of 271 LNYAQSSQCRNWQDSSVSYAAGALTVH 297U64857 274 LHYTQSNKVRSSVDSSVSYASGVLFVHPN 302 *.*.** . * *******.* * **
[THC132858-------------------]MKQFKRGIERDGTGFVVLMAEEAEDMWHIYNLIRIGDIIKASTIRKVVSETSTGTTSSQRVHTM
LTVSVESIDFDPGAQELHLKGRNIEENDIVKLGAYHTIDLEPNRKFTLQKTEWDSIDLERLNLA
[THC85433------------------------------------------LDPAQAADVAAVVLHEGLANVCLITPAMTLTRAKIDMTIPRKRKGFTSQHEKGLEKFYEAVSTA--------------------------------------------] {AA938998*****************FMRHVNLQVVKCVIVASRGFVKDAFMQHLIAHADANGKKFTTEQRAKFMLTHSSSGFKHALKEV*******} [THC200182----------------------------------------------------LETPQVALRLADTKAQGEVKALNQFLELMSTEPDRAFYGFNHVNRANQELAIETLLVADSLFRA-----------------------------------------------]QDIETRRKYVRLVESVREQNGKVHIFSSMHVSGEQLAQLTGCAAILRFPMPDLDDEPMDEN
Z36238
Translation of 1 MKLVRKNIEKDNAGQVTLVPEEPEDMWHTYNLVQVGDSLRASTIRKVQTESSTGS 55Z36238 1 MKQFKRGIERDGTGFVVLMAEEAEDMWHIYNLIRIGDIIKASTIRKVVSETSTGT 55 ** ...**.*..* * *. ** ***** ***...** ..******* .*.***.
Translation of 56 VGSNRVRTTLTLCVEAIDFDSQACQLRVKGTNIQENEYVKMGAYHTIELEPNRQF 110Z36238 56 TSSQRVHTMLTVSVESIDFDPGAQELHLKGRNIEENDIVKLGAYHTIDLEPNRKF 110 *.**.* **..**.**** * .*..** **.**. **.******.*****.*
Translation of 111 TLAKKQWDSVVLERIEQACDPAWSADVAAVVMQEGLAHICLVTPSMTLTRAKVEV 165Z36238 111 TLQKTEWDSIDLERLNLALDPAQAADVAAVVLHEGLANVCLITPAMTLTRAKIDM 165 ** * .***. ***. * *** .*******..****..**.**.*******...
Translation of 166 NIPRKRKGNCSQHDRALERFYEQVVQAIQRHIHFDVVKCILVASPGFVREQFCDY 220Z36238 166 TIPRKRKGFTSQHEKGLEKFYEAVSTAFMRHVNLQVVKCVIVASRGFVKDAFMQH 220 .******* .***.. **.*** * * **.. ****..*** ***.. *
Translation of 221 MFQQAVKTDNKLLLGNRSKFLQVHASSGHKYSLKEALCDPTVLARLSDTKAAGEV 275Z36238 221 LIAHADANGKKFTTEQRAKFMLTHSSSGFKHALKEVLETPQVALRLADTKAQGEV 275 . .* . * .*.**. *.*** * .*** * * * **.**** ***
sp|P48612|PELO_DROME PELOTA PROTEIN >gi|973224 (U27197) pelota ... 520 e-147sp|P50444|YNU6_CAEEL HYPOTHETICAL 42.9 KD PROTEIN R74.6 IN CHRO... 446 e-125gi|3941543 (AF069497) pelota [Arabidopsis thaliana] 385 e-106pir||S45456 DOM34 protein - yeast (Saccharomyces cerevisiae) >g... 236 2e-61sp|P33309|DO34_YEAST DOM34 PROTEIN >gi|295608 (L11277) DOM34 [S... 212 2e-54gnl|PID|e304505 (Z86109) unknown [Saccharomyces pastorianus] 199 3e-50gi|2622770 (AE000923) cell division protein [Methanobacterium t... 155 4e-37gnl|PID|d1031529 (AP000006) 356aa long hypothetical protein [Py... 146 3e-34sp|Q57638|Y174_METJA HYPOTHETICAL PROTEIN MJ0174 >gi|2127805|pi... 145 6e-34gi|2649765 (AE001046) cell division protein pelota (pelA) [Arch... 116 3e-25
sp|P50444|YNU6_CAEEL HYPOTHETICAL 42.9 KD PROTEIN R74.6 IN CHROMOSOME III >gi|3879163|gnl|PID|e1348805 (Z36238) Similar to the DOM34 protein of saccharomyces cerevisiae (Swiss Prot accession number P33309) [Caenorhabditis elegans] Length = 381 Score = 446 bits (1136), Expect = e-125 Identities = 215/371 (57%), Positives = 282/371 (75%)
BLASTP (Jan. 10, 1999)
C. elegans from WormPept: 18,452 entries HGI searches
*Families 3,934*Known Gene 7,954*New Contig 3,456*Undetermined 2,070
<100 aa 1,038
*150 full length genes so far, more expected following GAP closure and 5’RACE.*110 CGI genes were included in human reference genes.
83% between Human & C. elegans11% C. elegans specific
C. elegans from WormPept: 18,452 entries MGI searches
*Novel Genes 11,407
*Known Genes 4,151
*Undetermined 1,856 Short peptide 1,038
84% between Mouse & C. elegans
10% C. elegans specific
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
0 100 200 300 400 500 600 700 800 900 1000
C. e
lega
ns
prot
ein
len
gth
Human CGI protein length
Successful GAP-closureresults were obtained on11/12 novel genes
C. elegans from FlyDB: 13,146 entries HGI release 4.0 searches Expect e=10
*Known Gene 8,053*Families 3,560*New Contig 1,315*Undetermined 218
<100 aa 564
94% between Human & Drosophilia2% Drosophilia specific
March 24, 2000 - Sciences
Expect e=10-10
60% between Human & Drosophilia
Table 1The exon-intron junction sequences of the human crooked neck gene
Exon 5’ sequence 3’ sequence1 (EX3) — GTTGACAAAAACAgtaggtcacaaaatgatt2 gttcttactcccttgcagGGCGCCACGGCTC CAAAGTGGCCAAGgtaggcgatcgcgagggg3 tgtcttctttcttaaaagGTGAAAAACAAAG AAGGAAAAGGAAGgtcagtcagtgtggtatc4 ctttttgccttcttccagACTTTTGAAGATA AGGAGATTCAAAGgtaaaattactgagagtg5 caaatcttgcttcttaagGGCTCGATCCATA TTAATCAGTTCTGgtaagtttctgatctaac6 cgggtgattttgttacagGTACAAGTACACG ATTTATGAGCGATatatcctttggacgagat7 tccttaattcccttgcacTTGTCCTCGTGCA AAATCAGAAAGAGgtaagtatacctaacttc8 catcctgtctgcatttagTTTGAAAGGGTAC AGAAGAAGTGAAGgtgagcactggtgtggat9 aactggttgtctttctagGCGAATCCACACA ATTGGAGGCAAAGgtgaaaaaacagaattat10 tttcttatcctaatacagGATCCTGAGAGGA TCCTCACAAAAAGgtatgtttgctctaaagt11 tgtgcttttgttttctagTTCACATTTGCCA CAGAAGAGCATTGgtaagtaaagaaaggatc12 ggttattttattttccagGGAACTTCCATAG AGACATGCCAGAGgtgagcatctcaagtcaa13 tttacttttcttccttagGTGCTTTGGAAAT GCAGCATGTCAAGgtatccttgctttgtaga14 taataactttttttaaagGTATGGATCAGCT GACTGATGATGGGgtaagaactctgccctgg15 cattttttatttctgtagTCTGATGCAGGCT —
Note. Exon sequence is shown in bold uppercase letters, and intron sequence is shown in lowercase letters. Exon 2 was used as the translation initiation exon in most cells examined
(A)30
(A)15
ESTsAI001718AA157950AI800228AI336977AI814570AI968687AI924865AW004018
ESTsAA458635AI049780AW014044AA665047AI003171AW512558
ESTsAI018501AW952970AA825980AA195126
(A)10
Type I isoform
Type II isoform
exon 2
exon 2
5’ sequence EXON 2intron gttcttactcccttgcag GGCGCCACGGCTC
SNP
single nucleotide polymorphism:
DNA single base variations found in more than 1% of population.
SNP
• Most common form of genetic variation-Genetic linkage studies
-Genome-wide association studies• Indicate predisposition to
-Disease predisposition and onset
-Drug tolerance-Drug efficacy
• Genome SNP scans will uncover gene function, and define new drug targets
• SNPs will enable physicians to personalize therapy
Why are SNPs Important?Why are SNPs Important?Why are SNPs Important?Why are SNPs Important?
Human Variations
0.3%
2%
Responder
Non-responder
Toxic responder (adverse drug rxn)
SNP
Celera-PFP TSC Kwok
#SNP 2,104,820 585,811 438,032
RefHuman 2,525(0.12%) 613(0.14%) 995(0.17%)cSNP (missense)
non-conservative 1,187 251 398
Only few cSNPs??
Low frequency functional variants are needed in human disease gene discovery
EST based SNP discovery:
Multiple EST entries (>10) and Phred scores were required for SNP discovery.
Human reference protein sequences: 9848.
dbEST dataset: 2,208,221.
NP number Gene name a.a. cSNP
NP_002664 PLEXIN B1 [HOMO SAPIENS]. 2135 3NP_006856 LEUKOCYTE IMMUNOGLOBULIN-LIKE RECEPTOR; SUBFAMILY A (WITHOUT TM 439 3NP_006831 LEUKOCYTE IMMUNOGLOBULIN-LIKE RECEPTOR; SUBFAMILY B (WITH TM AND 590 3NP_005755 EPITHELIAL PROTEIN UP-REGULATED IN CARCINOMA; MEMBRANE ASSOCIATED 114 3NP_006411 BREFELDIN A-INHIBITED GUANINE NUCLEOTIDE-EXCHANGE PROTEIN 2 [HOMO 1785 3NP_004273 BCL2-ASSOCIATED ATHANOGENE 2; BAG-FAMILY MOLECULAR CHAPERONE 211 3NP_003772 UDP-GAL:BETAGLCNAC BETA 1;3-GALACTOSYLTRANSFERASE; POLYPEPTIDE 3 331 3NP_006854 LEUKOCYTE IMMUNOGLOBULIN-LIKE RECEPTOR; SUBFAMILY A (WITH TM 489 3NP_003881 IGG FC BINDING PROTEIN [HOMO SAPIENS]. 5404 3NP_031385 MLN51 PROTEIN [HOMO SAPIENS]. 534 4NP_031400 SINE OCULIS HOMEOBOX (DROSOPHILA) HOMOLOG 6 [HOMO SAPIENS]. 246 4NP_004087 EUKARYOTIC TRANSLATION INITIATION FACTOR 4E BINDING PROTEIN 2 [HOMO 120 4NP_005057 SPLICING FACTOR PROLINE/GLUTAMINE RICH (POLYPYRIMIDINE 707 4NP_004266 TRF-PROXIMAL PROTEIN [HOMO SAPIENS]. 209 4NP_004900 TAXOL RESISTANCE ASSOCIATED GENE 3 [HOMO SAPIENS]. 110 4NP_004731 TGF-BETA-1-INDUCED ANTIAPOPTOTIC FACTOR 1 [HOMO SAPIENS]. 115 4NP_004810 SYMPLEKIN [HOMO SAPIENS]. 1142 4NP_004800 STOMATIN-LIKE PROTEIN 1 [HOMO SAPIENS]. 394 4
FVD project (version 1) :
Human reference protein sequences: 9848.
total: 5,046,910 residues.
dbEST dataset: 2,208,221.
Predicted non-synonymous cSNP: 55,433.
average cSNP per protein: 5.62.
average length per protein: 514.48 a.a.
Variant EST = 1 40,215.Variant ESTs >1 15,218.
dbSNP match:
1,074 268 (synonymous)838 (non-synonymous)
NP_006323, IFI30
Potential error residues in reference proteins:
4,432 (0:>1)
Cancer Fetal Adult
* 0 0 32,357
0 * 0 11,441
0 0 * 18,859
PROTEIN PHOSPHATASE 1; CATALYTIC SUBUNIT