Genome Bioinformatics
Tyler Alioto
Center for Genomic RegulationBarcelona, Spain
04/18/23 INB Roadshow - Pamplona
Node 1 of the INB
GN1 Bioinformática y Genómica Genome Bioinformatic Lab, CRG
Roderic Guigó (PI)
04/18/23 INB Roadshow - Pamplona
Themes
Gene prediction ab initio => GeneID dual-genome => SGP2 u12 introns => GeneID v1.3 and U12DB combiner => GenePC
Genome feature visualization gff2ps
Alternative splicing ASTALAVISTA
Gene expression regulatory elements meta and mmeta alignment
04/18/23 INB Roadshow - Pamplona
Eukaryotic gene structure
04/18/23 INB Roadshow - Pamplona
Eukaryotic gene structure
EXONS
INTRONS
UPSTREAMREGULATOR
DOWNSTREAMREGULATOR
PROMOTOR
acceptor
donor
04/18/23 INB Roadshow - Pamplona
The Splicing Code
04/18/23 INB Roadshow - Pamplona
Gene Prediction Strategies Expressed Sequence (cDNA) or protein sequence
available? Yes Spliced alignment
BLAT, Exonerate, est_genome, spidey, GMAP, Genewise No Integrated gene prediction
Informant genome(s) available? Yes Dual or n-genome de novo predictors:
SGP2, Twinscan, NSCAN, (Genomescan – same or cross genome protein blastx)
No ab initio predictors geneid, genscan, augustus, fgenesh, genemark, etc.
Many newer gene predictors can run in multiple modes depending on the evidence available.
04/18/23 INB Roadshow - Pamplona
Frameworks for gene prediction
Hierarchical exon-buliding and chaining Hidden Markov Models (many flavors)
HMM, GHMM, GPHMM, Phylo-HMM Conditional Random Fields (new!)
Conrad, Contrast... and, no doubt, more to come
All of them involve parsing the optimal path of exons using dynamic programming (e.g. GenAmic, Viterbi algorithms)
How does GeneID approach gene prediction?
04/18/23 INB Roadshow - Pamplona
The gene prediction problem
a1
a2
a3
a4
d1d2
d3 d4
d5
e1e2e3
e4 e5
e6 e7
e8
sites
exons
genes
e1e4 e8
04/18/23 INB Roadshow - Pamplona
GeneID
Geneid follows a hierarchical structure: signal exon gene
Exon score: Score of exon-defining signals +
protein-coding potential (log-likelihood ratios)
Dynamic programming algorithm: maximize score of assembled
exons assembled gene
04/18/23 INB Roadshow - Pamplona
1 2 3 4 5 6 7 8 9
A 0.3 0.6 0.1 0.0 0.0 0.6 0.7 0.2 0.1
C 0.2 0.2 0.1 0.0 0.0 0.2 0.1 0.1 0.2
G 0.1 0.1 0.7 1.0 0.0 0.1 0.1 0.5 0.1
T 0.4 0.1 0.1 0.0 1.0 0.1 0.1 0.2 0.6
GAGGTAAAC
TCCGTAAGT
CAGGTTGGA
ACAGTCAGT
TAGGTCATT
TAGGTACTG
ATGGTAACT
CAGGTATAC
TGTGTGAGT
AAGGTAAGT
ATGGCAGGGACCGTGACGGAAGCCTGGGATGTGGCAGTATTTGCTGCCCGACGGCGCAAT
GATGAAGACGACACCACAAGGGATAGCTTGTTCACTTATACCAACAGCAACAATACCCGG
GGCCCCTTTGAAGGTCCAAACTATCACATTGCGCCACGCTGGGTCTACAATATCACTTCT
GTCTGGATGATTTTTGTGGTCATCGCTTCAATCTTCACCAATGGTTTGGTATTGGTGGCC
ACTGCCAAATTCAAGAAGCTACGGCATCCTCTGAACTGGATTCTGGTAAACTTGGCGATA
GCTGATCTGGGTGAGACGGTTATTGCCAGTACCATCAGTGTCATCAACCAGATCTCTGGC
Training GeneID
04/18/23 INB Roadshow - Pamplona
Running GeneID command line or on geneid server
NAMEgeneid - a program to annotate genomic sequences
SYNOPSISgeneid [-bdaefitnxszr]
[-DA] [-Z][-p gene_prefix][-G] [-3] [-X] [-M] [-m][-WCF] [-o][-j lower_bound_coord][-k upper_bound_coord][-O <gff_exons_file>][-R <gff_annotation-file>][-S <gff_homology_file>][-P <parameter_file>][-E exonweight][-V evidence_exonweight][-Bv] [-h]<locus_seq_in_fasta_format>
RELEASEgeneid v 1.3
OPTIONS-b: Output Start codons-d: Output Donor splice sites-a: Output Acceptor splice sites-e: Output Stop codons-f: Output Initial exons-i: Output Internal exons-t: Output Terminal exons-n: Output introns-s: Output Single genes-x: Output all predicted exons-z: Output Open Reading Frames
-D: Output genomic sequence of exons in predicted genes-A: Output amino acid sequence derived from predicted CDS
-p: Prefix this value to the names of predicted genes, peptides and CDS
-G: Use GFF format to print predictions-3: Use GFF3 format to print predictions-X: Use extended-format to print gene predictions-M: Use XML format to print gene predictions-m: Show DTD for XML-format output
-j Begin prediction at this coordinate-k End prediction at this coordinate-W: Only Forward sense prediction (Watson)-C: Only Reverse sense prediction (Crick)-U: Allow U12 introns (Requires appropriate U12 parameters to be set in the parameter file)-r: Use recursive splicing-F: Force the prediction of one gene structure-o: Only running exon prediction (disable gene prediction)-O <exons_filename>: Only running gene prediction (not exon prediction)-Z: Activate Open Reading Frames searching
-R <exons_filename>: Provide annotations to improve predictions-S <HSP_filename>: Using information from protein sequence alignments to improve predictions
-E: Add this value to the exon weight parameter (see parameter file)-V: Add this value to the score of evidence exons -P <parameter_file>: Use other than default parameter file (human)
-B: Display memory required to execute geneid given a sequence-v: Verbose. Display info messages-h: Show this help
AUTHORSgeneid_v1.3 has been developed by Enrique Blanco, Tyler Alioto and Roderic Guigo.Parameter files have been created by Genis Parra and Tyler Alioto. Any bug or suggestioncan be reported to [email protected]
04/18/23 INB Roadshow - Pamplona
GeneID output## gff-version 2
## date Mon Nov 26 14:37:15 2007
## source-version: geneid v 1.2 -- [email protected]
# Sequence HS307871 - Length = 4514 bps
# Optimal Gene Structure. 1 genes. Score = 16.20
# Gene 1 (Forward). 9 exons. 391 aa. Score = 16.20
HS307871 geneid_v1.2 Internal 1710 1860 -0.11 + 0 HS307871_1
HS307871 geneid_v1.2 Internal 1976 2055 0.24 + 2 HS307871_1
HS307871 geneid_v1.2 Internal 2132 2194 0.44 + 0 HS307871_1
HS307871 geneid_v1.2 Internal 2434 2682 4.66 + 0 HS307871_1
HS307871 geneid_v1.2 Internal 2749 2910 3.19 + 0 HS307871_1
HS307871 geneid_v1.2 Internal 3279 3416 0.97 + 0 HS307871_1
HS307871 geneid_v1.2 Internal 3576 3676 3.23 + 0 HS307871_1
HS307871 geneid_v1.2 Internal 3780 3846 -0.96 + 1 HS307871_1
HS307871 geneid_v1.2 Terminal 4179 4340 4.55 + 0 HS307871_1
04/18/23 INB Roadshow - Pamplona
GFF: a standard annotation format Stands for:
Gene Finding Format -or- General Feature Format Designed as a single line record for describing features on DNA sequence --
originally used for gene prediction output 9 tab-delimited fields common to all versions
seq source feature begin end score strand frame group The group field differs between versions, but in every case no tabs are allowed
GFF2: group is a unique description, usually the gene name. NCOA1
GFF2.5 / GTF (Gene Transfer Format): tag-value pairs introduced, start_codon and stop_codon are required features for CDS
transcript_id “NM_056789” ; gene_id “NCOA1” GFF3: Capitalized tags follow Sequence Ontology (SO) relationships, FASTA seqs can be
embedded ID=NM_056789_exon1; Parent=NM_056789; note=“5’ UTR exon”
04/18/23 INB Roadshow - Pamplona
GeneID output## gff-version 2
## date Mon Nov 26 14:37:15 2007
## source-version: geneid v 1.2 -- [email protected]
# Sequence HS307871 - Length = 4514 bps
# Optimal Gene Structure. 1 genes. Score = 16.20
# Gene 1 (Forward). 9 exons. 391 aa. Score = 16.20
HS307871 geneid_v1.2 Internal 1710 1860 -0.11 + 0 HS307871_1
HS307871 geneid_v1.2 Internal 1976 2055 0.24 + 2 HS307871_1
HS307871 geneid_v1.2 Internal 2132 2194 0.44 + 0 HS307871_1
HS307871 geneid_v1.2 Internal 2434 2682 4.66 + 0 HS307871_1
HS307871 geneid_v1.2 Internal 2749 2910 3.19 + 0 HS307871_1
HS307871 geneid_v1.2 Internal 3279 3416 0.97 + 0 HS307871_1
HS307871 geneid_v1.2 Internal 3576 3676 3.23 + 0 HS307871_1
HS307871 geneid_v1.2 Internal 3780 3846 -0.96 + 1 HS307871_1
HS307871 geneid_v1.2 Terminal 4179 4340 4.55 + 0 HS307871_1
04/18/23 INB Roadshow - Pamplona
Visualizing features with gff2ps
generated by Josep Abril
04/18/23 INB Roadshow - Pamplona
Visualizing features on UCSC genome browser (custom tracks)
If “your” genome is served by UCSC, this is a good option because: browsing is dynamic access to other annotations can view DNA sequence can do complex intersections and filtering
gff2ps is good when: your genome is not on UCSC you want more flexible layout options you want to run it ‘offline’
04/18/23 INB Roadshow - Pamplona
Extensions to GeneID
Syntenic Gene Prediction (dual-genome) Evidence-based (constrained) gene
prediction U12 intron detection Combining gene predictions Selenoprotein gene prediction
04/18/23 INB Roadshow - Pamplona
Syntenic Gene Prediction: SGP2
04/18/23 INB Roadshow - Pamplona
Minor splicing and U12 introns
U12 introns make up a minor proportion of all introns (~0.33% in human, less in insects)
But they can be found in 2-3% of genes Normally ignored, but this causes
annotation problems Easy to predict due to highly conserved
donor and branch sites
04/18/23 INB Roadshow - Pamplona
Splice Signal Profiles: major and minor
04/18/23 INB Roadshow - Pamplona
Gathering U12 Introns
U12 DB
genome
Human
merge
published
all annotated
introns
score
predict
563568385
658
ENSEMBL?
ortholog search (17 species)+ spliced alignment
genome
Fruit Fly
all annotated
introns
score
predict
merge
aln to EST/mRNA
aln to EST/mRNA
2084
597
04/18/23 INB Roadshow - Pamplona
04/18/23 INB Roadshow - Pamplona
Coming Soon: GenePCa Gene Prediction Combiner
04/18/23 INB Roadshow - Pamplona
Tutorial Homepage
http://genome.imim.es/courses/Pamplona07/
GBL Homepage
http://genome.imim.es/
Top Related