Solanaceae 2006 BAC Annotation
description
Transcript of Solanaceae 2006 BAC Annotation
Solanaceae 2006 Solanaceae 2006 BAC AnnotationBAC Annotation
2006. 07. 262006. 07. 26
Plant Genome Research CenterPlant Genome Research Center
KRIBB, KOREAKRIBB, KOREA
Developmental EnvironmentsDevelopmental Environments
• OS : SGI IRIX 6.5 • CPU : MIPS 500MHz 12 CPUs• MEM : 12288 MB
• OS : SUSE Linux 9.0 version 2.6.11.4-21.11-bigsmp• CPU : Intel(R) Xeon(TM) CPU 2.80GHz• MEM : 6231 MB
• DBMS : MySQL-4.0.25• Language : PHP 5.0.4, Apache 2.0.54, Perl-5.8.7
Data SetsData Sets
• BACs (SGN test BACs)– Annotated: 10
• ESTs : 200,015 (cf: 202,043 -current)• Full-length mRNAs (GenBank): 596• Protein DB (UniProt Release 7.7)
– Swiss-Prot/trEMBL: 228,917 / 2,914,826– Swiss-Prot/trEMBL(plant) 15,203 / 219,361
• Arabidopsis Proteins – Proteins, Genomes (TAIR): 30,693 – GO associated (TAIR): 28,812– Pathway/EC associated (KEGG): 1,521
• Tomato Chip DATA - tomato Expression Database (cornell)
Structural AnnotationStructural Annotation
Target AnalysisTools / Data
SGN Guideline KRIBB
Protein Coding Genes
Computational Gene Prediction
GeneMark.hmm, FGENESH, GlimmerM, GENSCAN+, Eugene
FGENESH (N.tabacuum)
GENSCAN
Experimental Gene Identification
GeneSeqer, SIM4, BLAST(Tomato cDNAs, ESTs, unigenes)
BLAT, SIM4, GMAP, GeneSeqer(dbEST, GenBank mRNAs),GeneWise2.0 (GenPept Proteins)
Resolution of Conflict PASA, GeneSeqer (Automatic)Apollo Genome Viewer (Manual)
Combined Modeller (Automatic)
Apollo Genome Viewer (Manual)
tRNA Computational tRNA Prediction
tRNAscan-SE tRNAscan-SE
Other RNAs
Similarity-based RNA Identification(microRNAs, snoRNAs)
- Cross-match(GenBank rRNA, Rfam)
Promoter TFBS/Promoter analysis
- Transfac, MEME, Gibs, Pratt
Repeats Repeat Scanning - RepeatMasker/Cross-match(RepBase/TIGR Plant Repeats)
Functional AnnotationFunctional AnnotationTarget Analysis
Tools / Data
SGN Guideline KRIBB
Conserved Functional Domains
InterProScan(InterPro Databases)
InterProScan (InterPro Databases)
Homology to Proteins
BLASTx(Arabidopsis, rice, Medicago, Swiss-Prot, GenBank nr)
BLASTx, WU-BLAST-2.0
(Swiss-Prot, trEMBL, Arabidopsis)
Gene Ontology assignment
- BLASTx
(Arabidopsis Proteins associated with GOA, TAIR GO data)
EC/Pathway - BLASTx
(Arabidopsis Proteins associated with KEGG EC/Pathway data)
TFBS /
Promoter
WU-BLAST2 (blastx)
Arabidopsis proteins associated with TFBS/Promotor
Function of
Protein Coding Genes
Protein Location Predictions
Transmembrane Domains (TMHMM), Subcellular Location(TargetP)
Transmembrane Domains (TMHMM), Subcellular Location(TargetP)
Define gene structure by various data evidencesDefine gene structure by various data evidences
• Full-length evidenced genes (mRNAs / Proteins)
• Full-length clue evidenced genes (Full-length clue ESTs from Kazusa full-length cDNA library)
• Partially evidenced genes (Other partial ESTs)
• No-evidenced genes (Prediction only)
PredictmRNAProtein
PredictEST
1) Full-length Evidenced Genes
• Gene locus with full-length mRNA / Protein (GMAP, GeneWise)• Almost complete gene structure: Gene boundary (mRNA:TSS/poly-A,
protein:CDS), Exon/Intron, (some alternative splicing structure)• Requirement: more than 1 mRNA or Proteins• Processing:
– Merge the same AS forms– mRNA evidence: Predict CDS (ESTscan etc.)– Protein evidence: Mend gene boundary(TSS, poly-A)
mRNA
Protein
Predict
Sample Sample
mRNAsmRNAs
TIGR TCTIGR TC
stackPACKstackPACK
ESTsESTs
Predicted GenesPredicted Genes
2) Full-length Clue Evidenced Genes2) Full-length Clue Evidenced Genes
• Gene locus with full-length clue ESTs from Kazusa full-length cDNA library (GMAP)
• Gene boundary(TSS, poly-A), some Exon/Intron• Requirement: more than 1 full-length clue ESTs• Processing:
– Merge the same AS forms– Link the same-cloned ESTs– Mend uncomplete portion with predicted model– CDS to be predicted (ESTscan / orfPredictor etc.)
EST
Predict
Sample Sample
Full length Clue ESTsFull length Clue ESTs(kazusa)(kazusa)
Predicted GenesPredicted Genes
ESTsESTs
3) Partially Evidenced Genes3) Partially Evidenced Genes
• Gene locus with general ESTs (GMAP)• Some Exon/Intron, poly-A• More ESTs, more information expected• Requirement: more than 2 ESTs with more than 2 couples
of overlapped hard-edges• Processing:
– Merge the same AS forms– Link the same-cloned ESTs– Mend incomplete portion with predicted model– CDS to be predicted (ESTscan/orfPredictor etc.)
EST1
Predict
EST2
Sample Sample
ESTsESTs
Predicted GenesPredicted Genes
4) No-evidenced Genes
• Predicted model only (hypothetical gene)
• Predicted CDS
PredictSample Sample
No Evidence !!No Evidence !!
Gene Structure Annotation - Gene Structure Annotation - ProblemsProblems
False positive intergenic region:2 annotated genes actually correspond to a single gene
False negative intergenic region: One annotated gene structure actually contains 2 genes
False negative gene prediction: Missing gene (no annotation)
Other: partially incorrect gene annotation missing annotation of alternative transcripts -Alternative Splicing
Pseudo-genesPromoter / Regulatory Elements
Estimated Gene PredictionEstimated Gene PredictionCATEGORYCATEGORY NUMBERNUMBERPredicted Genes 301 TSS 294 Start Codon 296 Stop Codon 297 PAS signals 1) 100 PolyA ( ≥ 7) 296Genes overlapping EST Clusers 148 Genes hitting mulitple EST Clusters 61 Genes hitting single EST Clusters 87Genes overlapping ESTs 165 EST mapping Genes (≥ 2) 109 EST mapping Genes ( =1) 56Genes hitting mRNAs 6Genes hitting Full-length cDNAs 20
1) hexamer signal A(A/U)AAA - PASes (predict polyadenylation signals) hexamers
Gene Structure BrowserGene Structure Browser
• Test BLAT/SIM4/GMAP/GeneSeqer– BLAT – Fast/Unaccurate– SIM4/GMAP/GeneSeqer – Approx. the Same results
• KRIBB: Prefiltering ESTs by BLAT + GMAP• Cutoff: Coverage > 80%, Identity > 90%
dbESTs
TIGR TC
UnigeneKazusa Full ESTs
Protein
FGENESH
GENSCAN
mRNARepeats / Domain
Click !!
Click !!
Functional AnnotationFunctional Annotation
Protein DB/ EC / GOProtein DB/ EC / GO
TFBS / PromoterTFBS / Promoter
Protein DB / GOProtein DB / GO
Functional AnnotationFunctional Annotation
TargetP/TMHMMTargetP/TMHMM
Enzyme / PathwayEnzyme / Pathway
Domain / MotifDomain / Motif
Functional AnnotationFunctional Annotation
Expression AnnotationExpression Annotation(Digital Expression )(Digital Expression )
Principle of identifying differentially expressed genes by Hypergeometric Test N: ESTs for all genes in all tissues,n: ESTs for selected genes in all tissues,K: ESTs for all genes in selected tissue,k: ESTs for selected gene in selected tissue,P: Significance of over- or under-expression in selected tissue
Expression AnnotationExpression Annotation(ARRAY CHIP)(ARRAY CHIP)
Expression Annotation Expression Annotation (Tissue Specific Genes)(Tissue Specific Genes)
Principle of identifying differentially expressed genes by Audic's TestPrinciple of identifying differentially expressed genes by Audic's Test
x: number of cognate ESTs of a given gene in a selected libraryN1: selected libraryy: number of cognate ESTs of a given gene in other libraryN2: other library
CaActin
CacnA (16)
CacnB (18)
CacnC (13)
CacnD (10)
CacnE (25)
CacnF (31)
CacnG (20)
Leaf
stem root
Buf
Xag
IM M.G
Break
erM.R
Flor
al b
ud
Flow
erBar
k
Flower
Pathogen
Fruit* 25 cycles, annealing temp. 55℃* (# of ESTs)
Pepper tissue-specific gene analysis
Annotation ResultsAnnotation ResultsPropertyProperty ValueValue UnitUnit
BAC (Annotated)
Length (Average)
10
120
BAC
kb
Putative Protein CDSs
Gene Density
Gene Length, Average
Exon Length, Average
Exons per Gene, Average
With ESTs
Protein Annotated
Domain Annotated
GO Annotated
Pathway Annotated
EC Annotated
TFBS/Promoter Annotated
Tissue specific Annotated
Expression Annotated
301
4.2
3.1
338
8.4
165
196
213
144
17
17
127
56
18
gene
kb/gene
kb
bp
exon/gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
tRNA 0 gene
Repeats 144 kb
Thanks !!Thanks !!
Solanaceae 2006 BAC Annotation Test page
http://crop.kribb.re.kr/SOL-Test/
http://sol.kribb.re.kr/