Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA
description
Transcript of Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA
![Page 1: Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA](https://reader036.fdocuments.net/reader036/viewer/2022062304/56814511550346895db1d3af/html5/thumbnails/1.jpg)
Chinnappa Kodira
April 2004 GMOD 2004, Cambridge, MA
Manual Annotation of Human Genome at Broad Institute
![Page 2: Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA](https://reader036.fdocuments.net/reader036/viewer/2022062304/56814511550346895db1d3af/html5/thumbnails/2.jpg)
Goals
Accurate and comprehensive catalog of genes and gene products
Robust annotation system for annotation of all sequenced genomes
![Page 3: Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA](https://reader036.fdocuments.net/reader036/viewer/2022062304/56814511550346895db1d3af/html5/thumbnails/3.jpg)
Annotation Strategy: Evidence-based Annotation
CSMD1 gene:Gene Size: 2065,608 bases
Transcript Length: 11,297 basesProtein Length: 3565 aa
No of Exons: 68 Average length of Exons : 166 bases
Fgensh 20
Genscan 25
Blat_EST 179
mRNA 3
![Page 4: Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA](https://reader036.fdocuments.net/reader036/viewer/2022062304/56814511550346895db1d3af/html5/thumbnails/4.jpg)
Rule-based AnnotationFL-mRNA
Species-specific ESTs
Cross-species ESTs
Protein homology
Ecores + GenePredictionsDecreasing order of confidence level
![Page 5: Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA](https://reader036.fdocuments.net/reader036/viewer/2022062304/56814511550346895db1d3af/html5/thumbnails/5.jpg)
Annotation System
Automated GeneCaller
Publication
database
Loader
Genome Evidence
Transcript HunterManual Annotation
Argo Genome Browser
Alignment
QA
![Page 6: Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA](https://reader036.fdocuments.net/reader036/viewer/2022062304/56814511550346895db1d3af/html5/thumbnails/6.jpg)
Critical Steps in our Annotation Process Running Computes
Selection and Filtering Evidence
Intelligent Automated Gene Caller
Genome Browser and Editor
Annotation Rules
Trained Manual Annotators
Annotation QA Process
![Page 7: Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA](https://reader036.fdocuments.net/reader036/viewer/2022062304/56814511550346895db1d3af/html5/thumbnails/7.jpg)
Computes
Finished Sequence
Repeat Mask Homology Search
Sequence AlignmentGene Prediction
Computed Features
Filtering of High Quality Evidence•Identity >95% and >50% QS coverage
•Splice Junctions
•Rank Order
•Repeat filtering
Annotation
Raw Features
![Page 8: Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA](https://reader036.fdocuments.net/reader036/viewer/2022062304/56814511550346895db1d3af/html5/thumbnails/8.jpg)
TranscriptHunter
Computed Features
Exon-based Clustering
•Define Gene Locus
Intron Edge Clustering
•Identify Variants
TranscriptHunter
Creation of Gene Models•ORF and UTRs•Gene Name•Transcript Classification•Curation Flags
![Page 9: Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA](https://reader036.fdocuments.net/reader036/viewer/2022062304/56814511550346895db1d3af/html5/thumbnails/9.jpg)
Screening of spliced ESTs contained within repeat elements
AluYb8 Repeat
Spliced ESTs
![Page 10: Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA](https://reader036.fdocuments.net/reader036/viewer/2022062304/56814511550346895db1d3af/html5/thumbnails/10.jpg)
Manual annotation
TranscriptHunter Gene Models
•Refine Gene Boundaries
•Exon/Intron
•3’ and 5’ UTR
•Create New Genes
•Classify Transcripts
•Edit Automated Gene Calls
•Identify Pseudogenes
•Add Curation Flags
•Call/Adjust ORF
•Select PolyA Signals
AnnotDB
![Page 11: Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA](https://reader036.fdocuments.net/reader036/viewer/2022062304/56814511550346895db1d3af/html5/thumbnails/11.jpg)
Features of Argo Attaching primary and supplemental evidence
Cluster feature display
Filtering and customizing evidence list
Display poly A signals and splice junctions
Alerting discrepancies before updating
Highlighting parent and child features
Real-time interactive analysis
ORF selection options
Tabular dump of selected features
Roll back and save work
Customization of feature display
![Page 12: Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA](https://reader036.fdocuments.net/reader036/viewer/2022062304/56814511550346895db1d3af/html5/thumbnails/12.jpg)
Annotation View
![Page 13: Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA](https://reader036.fdocuments.net/reader036/viewer/2022062304/56814511550346895db1d3af/html5/thumbnails/13.jpg)
Confidence levels of our gene models
Classification of transcripts –Hawk standards Known, Novel_CDS, Novel, Putative, Pseudogene
Association of primary and supplemental evidence with annotated feature
Rank order in selection of supporting evidence
Curation flags
Free text comments
![Page 14: Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA](https://reader036.fdocuments.net/reader036/viewer/2022062304/56814511550346895db1d3af/html5/thumbnails/14.jpg)
Gene counts for Broad and Ensembl
chrom known novel known novel+putative Spl count pseudogene8 4.7 710 132 724 587 2.6 298
15 2.7 581 165 589 556 2.8 21317 2.6 1120 167 1134 578 3.3 26418 2.5 265 73 289 275 2.1 167
TOTAL 12.5 2676 537 2736 1996 942
Ensembl Broad genome
(%)
![Page 15: Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA](https://reader036.fdocuments.net/reader036/viewer/2022062304/56814511550346895db1d3af/html5/thumbnails/15.jpg)
Manually Annotated Gene Models vs. public Gene Models
Broad
MGC
Refseq
ENSEMBL
Gene-wise
mRNA
![Page 16: Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA](https://reader036.fdocuments.net/reader036/viewer/2022062304/56814511550346895db1d3af/html5/thumbnails/16.jpg)
Types of splice variation
Type % of variants
extra 31
skip 18
alt site 33
run on 18
CDS altered 84 %
new stop 48 %
![Page 17: Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA](https://reader036.fdocuments.net/reader036/viewer/2022062304/56814511550346895db1d3af/html5/thumbnails/17.jpg)
Our data extend most RefSeq/MGC transcripts
distribution of extensions relative to RefSeq or MGC evidence(human chromosomes 8, 15, 17, 18)
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
100 200 300 400 500 600 700 800 900 1000
length of extension (bp)
% o
f ex
ten
sio
ns
5'
3'
38 % positive for 5' extension71 % positive for 3' extension30 % positive for both79 % positive for either
median 5' extension = 46 basesmedian 3' extension = 143 bases
![Page 18: Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA](https://reader036.fdocuments.net/reader036/viewer/2022062304/56814511550346895db1d3af/html5/thumbnails/18.jpg)
Complete 3 end as compared to Refseq mRNA and ENSEMBL gene
![Page 19: Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA](https://reader036.fdocuments.net/reader036/viewer/2022062304/56814511550346895db1d3af/html5/thumbnails/19.jpg)
How valid are these 3’ and 5’ extensions ?
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Broad
ENSEMBL
Broad 86% 1.16%
ENSEMBL 68% 10.89%
PolyA signals5 ^ATG…STOP$
![Page 20: Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA](https://reader036.fdocuments.net/reader036/viewer/2022062304/56814511550346895db1d3af/html5/thumbnails/20.jpg)
Using Start and Stop Codon Context to Refine
Annotation
Location of stop codons on exons
0102030405060708090
100
n n-1 n-2 n-3 n-4 n-5
exon order
% st
op co
dons stop codons
Location of start codons on exons
0
10
20
30
40
50
60
70
1 2 3 4 5 6
exon order
% sta
rt co
dons start codons
•Pseudogenes•Real Stop codons•NMD candidates•Sequence Errors•Non-coding genes•SECIS genes
•Pseudogenes•Real Start codons•NMD candidates•Sequence Errors•Non-coding genes
![Page 21: Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA](https://reader036.fdocuments.net/reader036/viewer/2022062304/56814511550346895db1d3af/html5/thumbnails/21.jpg)
Issues with Novel and putative transcripts
•High number
•Low depth EST coverage
•Small transcript size
•Low no of variants
•Poor coding potential
•Poor cross-species conservation
•Low poly A frequency
•Weak CpG context
• Spurious transcription
• Mostly partial
• Temporal genes
• Non-coding
• Poorly expressed
• Lineage specific
•
Concerns Probable reasons
![Page 22: Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA](https://reader036.fdocuments.net/reader036/viewer/2022062304/56814511550346895db1d3af/html5/thumbnails/22.jpg)
Putative Novel Known Transcript
PutativeNovel
Known
![Page 23: Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA](https://reader036.fdocuments.net/reader036/viewer/2022062304/56814511550346895db1d3af/html5/thumbnails/23.jpg)
Annotating Non-coding mRNAs is still a challenge !!!
Sno RNAs
![Page 24: Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA](https://reader036.fdocuments.net/reader036/viewer/2022062304/56814511550346895db1d3af/html5/thumbnails/24.jpg)
Challenges Ahead….
Establishing Common Standards
Validating Novel Transcripts
Single Exon Expressed Sequences
Determination of Accurate ORFs
Annotation of Functionally Relevant Alternative Splice Forms
Finding Sparsely Expressed Genes
Annotation of New Types of Non-coding Functional mRNAs
Incremental Update of Annotation
Capturing Biological Exceptions
![Page 25: Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA](https://reader036.fdocuments.net/reader036/viewer/2022062304/56814511550346895db1d3af/html5/thumbnails/25.jpg)
Acknowledgements
•Reinhard Engels
•Shunguang Wang
•Seth Purcell
•Tim Elkins
•Yuhong Wu
•Serge Smirnov
•Sarah Calvo
•David Dicaprio
Annotation and Analysis
•Charlie Whittaker
•Mark Borowsky
•Sinead O’leary
•James Galagan
•Jill Mesirov
•Eric Lander
•Sequencing, Finishing and Closure Teams
Annotation Pipeline
![Page 26: Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA](https://reader036.fdocuments.net/reader036/viewer/2022062304/56814511550346895db1d3af/html5/thumbnails/26.jpg)
Comparison of alternative splice forms between ENSEMBL and Broad annotation
Broad
ENSEMBL
Refseq
dbEST
nrnt-mRNA
Manually Annotated Gene Models vs. public Gene Models
![Page 27: Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA](https://reader036.fdocuments.net/reader036/viewer/2022062304/56814511550346895db1d3af/html5/thumbnails/27.jpg)
ENSEMBLGENEWISE
REFSEQ
Transcript Hunter
MANUALANNOTATION
ESTs
PolyA signal
Novel Transcript Variants of Known Genes