Gene-omes built from mRNA-seq not genome...

1
Fig.1 Gene-ome Completeness of Animals & Plants 0 200 400 600 800 1000 0 200 400 600 800 1000 Version Improvement for Gene Size x Homology Evigene vs TSA Top1K mRNA Assemblies Protein Size Improvement (ave. nAA) Homology Score Improvement (ave. nAA) banana1bg banana1bt catfish drosmel53 killifish locust1bt pogonus_beetle tiger_shrimp whitefly white_shrimp slope= 0.72 Rsq= 0.934 p= 1e−05 Fig.2 mRNA Assembly Improvements, Protein Size x Homology Alignment Fig.3 Alignment of mRNA As- semblies to Reference Genes for longest 200 ref genes. (RED = this, BLUE = max) Human12 Human05 Zebrafish11 Zebrafish05 Killifish12/evg Catfish13/evg Catfish12/tsa Stickleback12 Arabidopsis10 Arabidopsis05 Cacao12/evg Populus11 Banana13/evg Banana12/tsa Drosophila11 Drosophila04 Drosophila02 Tribolium11 Pbeetle13/evg Pbeetle12/tsa Honeybee11 Honeybee05 Nasonia12/evg Nasonia09 Aphid11/evg Aphid10 Locust13/evg Locust12 Whitefly13/evg Whitefly12/tsa Daphniap10/evg Daphniap06 Daphniam13/evg Daphniam11/evg Wshrimp13/evg Wshrimp12/tsa Tshrimp13/evg Tshrimp12/tsa Ztick13/evg Ztick13/tsa Ixodes11 Tetur11 Species Proteome CDS size (1000 longest ave.) 2000 4000 6000 8000 100k 98k 98k 22k 86k 40k 39k 78k 16k 16k 16k 16k 15k 6k 69k 69k 54k 63k 56k 17k 70k 54k 50k 24k 62k 18k43k 18k 37k 6k 23k 23k 67k 32k 26k 8k 26k 8k 24k 20k 14k 55k evgD evgR evgD evgR evgR evgD evgD evgR evgR evgD evgR evgD evgR evgR evgR H H Z Z K C C S A A C P B B D D D T P P H H N N A A L L WW D D D DWW T T Z Z I T evgD = evigene gDNA evgR = evigene mRNA 100k = longest CDS in kb Crustacea Insects Plants Ticks Vertebrates Clade-Best Evigene mRNA Clade-Good Evigene mRNA 0 50 100 150 200 0 20 40 60 80 RefHuDaTr_median−ticks: ixodes (55% ref) Longest Ref Genes Protein alignment % 0 50 100 150 200 0 20 40 60 80 RefHuDaTr_median−ticks: tetur (64% ref) Longest Ref Genes Protein alignment % 0 50 100 150 200 0 20 40 60 80 RefHuDaTr_median−ticks: ztick_evg (70% ref) Longest Ref Genes Protein alignment % 0 50 100 150 200 0 20 40 60 80 RefHuDaTr_median−ticks: ztick_tsa (55% ref) Longest Ref Genes Protein alignment % Ticks Same RNA Ixodes tick Spider mite Zebra tick TSA 0 50 100 150 200 0 20 40 60 80 RefHoneybee−insects: pgbeetl_evg (73% ref) Longest Ref Genes Protein alignment % 0 50 100 150 200 0 20 40 60 80 RefHoneybee−insects: pgbeetl_tsa (54% ref) Longest Ref Genes Protein alignment % 0 50 100 150 200 0 20 40 60 80 RefHoneybee−insects: pinebeetle (54% ref) Longest Ref Genes Protein alignment % 0 50 100 150 200 0 20 40 60 80 RefHoneybee−insects: trbeetle (67% ref) Longest Ref Genes Protein alignment % Beetles Same RNA Pogonus TSA Pine Tribolium Pogonus EvgR 0 50 100 150 200 0 20 40 60 80 RefBeetle−crusts: daphmag5_evg (67% ref) Longest Ref Genes Protein alignment % 0 50 100 150 200 0 20 40 60 80 RefBeetle−crusts: litova1_evg (49% ref) Longest Ref Genes Protein alignment % 0 50 100 150 200 0 20 40 60 80 RefBeetle−crusts: penaeus_evg (65% ref) Longest Ref Genes Protein alignment % 0 50 100 150 200 0 20 40 60 80 RefBeetle−crusts: penaeus_tsa (14% ref) Longest Ref Genes Protein alignment % Crustacea Same RNA Daphnia m EvgR White shrimp EvgR Tiger shrimp TSA Tiger shrimp EvgR Zebra tick EvgR EvidentialGene mRNA-assembly pipeline Gene-ome construction with mRNA-seq How to build perfect mRNA transcript assemblies Does & Don’ts EvidentialGene http://arthropods.eugenes.org/EvidentialGene/ Genome collaborators and data providers: Daphnia Genome Consortium, Generic Model Organism Database, Indiana Univ. Center for Genomics & Bioinformatics, International Aphid Genomics Consortium, Nasonia jewel wasp Genome project, Cacao chocolate tree Genome project. Funding: NSF mostly, NIH, Mars. Computers: TeraGrid/XSEDE, NCGAS Gene-omes built from mRNA-seq not genome DNA Simple, quick, accurate, less cost, more complete gene sets Don Gilbert, Biology Dept., Indiana University, Bloomington, IN 47405, [email protected] 2013 June For the last 2 decades, complete gene sets have been predicted from gene signal statistics in genomic DNA. The advent of high quality, high volume transcript sequencing provides data suited to constructing genes without statistical guesses, from biological gene products. Informatics methods now have caught up to this data, to construct biologically accurate, measurably complete organism gene sets, or transcriptomes. Recently improved mRNA assembly methods of the EvidentialGene project (http://arthropods.eugenes.org/EvidentialGene/) are show here with Crustacean, Insect and Tick examples. These methods are relatively simple, rapid and biologically valid; simpler, quicker and better than genome-based predictions. While not yet in general practice, these are recommended as they yeild large improvements to published mRNA assemblies or genome- based predictions. RNA assembly combined with genome-based modelling gives more complete answers, but gene-centric projects will benefit by allocating more effort to transcript sequencing. Why mRNA-assembly is better than genome-gene prediction - Daphnia magna genome-assembly is missing 1/2 expected size to gaps, duplicates. Dmag mRNA genes are now are most complete of crustacea. - Ixodes tick genome genes very fragmented with transposons, Zebra tick mRNA genes much more complete - Pine beetle genome-gene predictions (Maker) well below Pogonus beetle mRNA-genes in homology to reference genes, despite use of ref genes for predictions. * Genome assembly gaps, mis-assemblies are common, and disrupt gene models. * Transposons are a major problem for gene modelling, and abundant. * Long introns, common w/ transposons, fragment genome-gene models * Genome gene predictor training on valid species genes is essential and difficult, no training for mRNA assembly. * Genome gene prediction remains an educated guess. mRNA-assembly is now engineering with reliably good results given accurate data and methods. * Reference protein mapping needed for genome-genes, but introduces errors. mRNA-assembly without ref proteins can/does produce stronger homology genes with no confounding of evidence. * mRNA-assembly is simple, quick, accurate, less cost, more complete relative to genome-gene prediction. EvidentialGene tr2aacds.pl is my new, somewhat easy to use pipeline for processing large piles of transcript assemblies into a biologically useful "best" set of mRNA, classified into primary and alternate transcripts. Classification is based primarily on CDS-dna local alignment identity. Transcripts at one locus share exon-sized or larger identities. Perfect fragment CDS are dropped, those with some CDS base differences are kept, with longest CDS as primary transcript. UTR identity is ignored (for now) because many of the mis-assemblies are from joined/mangled genes in UTR region. Algorithm of tr2aacds: 0. collect input transcripts.tr, produce CDS and AA sequences, work mostly on CDS. 1. perfect redundant removal with fastanrdb 2. perfect fragment removal with cd-hit-est 3. blastn, basic local align high-identity subsequences for alternate tr. 4. classify main/alternate cds, okay & drop subsets by CDS-align, protein metrics. 5. output sequence sets from classifier: okay-main, okay-alts, drops. See http://eugenes.org/EvidentialGene/about/EvidentialGene_trassembly_pipe.html Other Evigene scripts for mRNA assembly evigene/scripts/rnaseq/trformat.pl : regularize and unique IDs in transcript.fasta, adding prefixes for parameter sets. evigene/prot/namegenes.pl : add gene function names from UniProt and Conserved Domains (CDD) with delta-blastp. evigene/scripts/rnaseq/asmrna_trimvec.pl : process NCBI vector screen, and trim end gaps in transcripts. evigene/scripts/evgmrna2tsa.pl : check mRNA, add annotation, create public IDs and sequence files, write Genbank TSA format for public submission. Don’t: Use one assembly program, default options. Do: Use several assembly programs and options, as each will provide some better transcripts/genes. Don’t: Make many assemblies then pick one as a best gene set. Do: make millions of transcript assemblies using multiple kmers, programs and other options. Use all assemblies to pick best mRNA per locus. No single assembly is better for all loci than others, because of the large variation in expression levels, gene sizes, types of read errors, etc., that require different options and methods. Use multiple kmer sizes, from read-size down to 25. Do: Use good RNA de-novo assemblers, even if genome assembly is available. Velvet/Oases, SoapDenovo-Trans, and Trinity produce good assemblies, in that order in my work, and each produces some best assemblies the others miss. Don’t: Use Cufflinks only for genome-mapped RNA. Cufflinks underperforms in assembly versus de-novo assemblers, and has more errors of commision (joins) and omission (missing). Don’t: Select longest transcript as best mRNA, as this selects for errors in assembly. Many methods do this implicitly. Do: Select best mRNA with coding-sequence metrics (longest ORF, complete if possible). Longest proteins correlate strongly with strongest homology to other species. Do: Use at least 200 Million reads of 100bp or longer, mate-paired, to get a complete transcriptome, from current Illimina sequencers. Don’t: Use longer 454 reads, due to high error rate. I've not tested very long reads of new machines, but their errors may cause similar problems. Don’t: expect your species/data set to assemble in same way as others have reported. Don't rely on older software without testing newer, and don't expect newer versions to be better (but often they are). Do: use EvidentialGene scripts and methods for mRNA transcript assembly. The current EvidentialGene_trassembly_pipe is useable by others. http://eugenes.org/EvidentialGene/about/EvidentialGene_trassembly_pipe.html RNA assembly can outstrip computing resources. Large memory and cpu clusters are available as shared resources (NSF-XSEDE, others). Digital normalization and genome-mapped partitioning to assemble very large data sets in parts can help. New or update Gene-omes from EvidentialGene arthropods/ locust/locust1eg6/ Locusta migratoria; pogonusbeetle/pogonus1eg6/ Pogonus chalceus; whitefly/whitefly1eg6/ Bemisia tabaci; whiteshrimp/litova1eg6/ Litopenaeus vannamei; tigershrimp/shrimpt1eg6/ Penaeus monodon; zebratick/ztick4eg6/ Rhipicephalus pulchellus; plants/banana/banana1eg6/ Musa acuminata; cacao/genes/ Theobroma cacao; vertebrates/catfish/catfish1eg6/ Ictalurus punctatus;

Transcript of Gene-omes built from mRNA-seq not genome...

Page 1: Gene-omes built from mRNA-seq not genome DNAarthropods.eugenes.org/.../about/EvigeneRNA2013poster.pdf · 2013. 6. 12. · * Genome assembly gaps, mis-assemblies are common, and disrupt

Fig.1 Gene-ome Completeness of Animals & Plants

0 200 400 600 800 1000

020

040

060

080

010

00

Version Improvement for Gene Size x HomologyEvigene vs TSA Top1K mRNA Assemblies

Protein Size Improvement (ave. nAA)

Hom

olog

y Sc

ore

Impr

ovem

ent (

ave.

nAA

)

banana1bg

banana1bt

catfish

drosmel53

killifish

locust1bt

pogonus_beetle

tiger_shrimp

whitefly

white_shrimp

slope= 0.72 Rsq= 0.934 p= 1e−05

Fig.2 mRNA Assembly Improvements, Protein Size x Homology Alignment

Fig.3 Alignment of mRNA As-semblies to Reference Genes for longest 200 ref genes. (RED = this, BLUE = max)

Human12Human05Zebrafish11Zebrafish05Killifish12/evgCatfish13/evgCatfish12/tsa

Stickleback12Arabidopsis10Arabidopsis05Cacao12/evgPopulus11Banana13/evgBanana12/tsa

Drosophila11Drosophila04Drosophila02Tribolium11Pbeetle13/evgPbeetle12/tsaHoneybee11

Honeybee05Nasonia12/evgNasonia09Aphid11/evgAphid10Locust13/evgLocust12

Whitefly13/evgWhitefly12/tsaDaphniap10/evgDaphniap06Daphniam13/evgDaphniam11/evgWshrimp13/evg

Wshrimp12/tsaTshrimp13/evgTshrimp12/tsaZtick13/evgZtick13/tsaIxodes11Tetur11

Species Proteome

CD

S si

ze (1

000

long

est a

ve.)

2000

4000

6000

8000

100k

98k

98k

22k

86k40k

39k78k

16k16k

16k16k

15k

6k

69k69k

54k63k56k

17k

70k54k

50k24k

62k

18k43k

18k

37k

6k

23k

23k

67k

32k

26k

8k

26k

8k

24k

20k

14k

55k

evgD

evgR

evgD

evgR

evgR

evgD

evgD

evgR

evgR

evgD

evgR

evgD

evgR

evgR

evgR

H H Z Z K C C S A A C P B B D D D T P P H H N N A A L L W W D D D D W W T T Z Z I T

evgD = evigene gDNAevgR = evigene mRNA100k = longest CDS in kb

CrustaceaInsectsPlants TicksVertebrates

Clade-Best Evigene mRNA

Clade-Good Evigene mRNA

0 50 100 150 200

020

4060

80

RefHuDaTr_median−ticks: ixodes (55% ref)

Longest Ref Genes

Prot

ein

alig

nmen

t %

0 50 100 150 200

020

4060

80

RefHuDaTr_median−ticks: tetur (64% ref)

Longest Ref Genes

Prot

ein

alig

nmen

t %

0 50 100 150 200

020

4060

80

RefHuDaTr_median−ticks: ztick_evg (70% ref)

Longest Ref Genes

Prot

ein

alig

nmen

t %

0 50 100 150 200

020

4060

80

RefHuDaTr_median−ticks: ztick_tsa (55% ref)

Longest Ref Genes

Prot

ein

alig

nmen

t %

Ticks

Same RNA

Ixodes tick

Spider mite

Zebra tick TSA

0 50 100 150 200

020

4060

80

RefHoneybee−insects: pgbeetl_evg (73% ref)

Longest Ref Genes

Prot

ein

alig

nmen

t %

0 50 100 150 200

020

4060

80

RefHoneybee−insects: pgbeetl_tsa (54% ref)

Longest Ref Genes

Prot

ein

alig

nmen

t %

0 50 100 150 200

020

4060

80

RefHoneybee−insects: pinebeetle (54% ref)

Longest Ref Genes

Prot

ein

alig

nmen

t %

0 50 100 150 200

020

4060

80

RefHoneybee−insects: trbeetle (67% ref)

Longest Ref Genes

Prot

ein

alig

nmen

t %

Beetles

Same RNA Pogonus TSA

Pine

Tribolium

Pogonus EvgR

0 50 100 150 200

020

4060

80

RefBeetle−crusts: daphmag5_evg (67% ref)

Longest Ref Genes

Prot

ein

alig

nmen

t %

0 50 100 150 200

020

4060

80

RefBeetle−crusts: litova1_evg (49% ref)

Longest Ref Genes

Prot

ein

alig

nmen

t %

0 50 100 150 200

020

4060

80

RefBeetle−crusts: penaeus_evg (65% ref)

Longest Ref Genes

Prot

ein

alig

nmen

t %

0 50 100 150 200

020

4060

80

RefBeetle−crusts: penaeus_tsa (14% ref)

Longest Ref Genes

Prot

ein

alig

nmen

t %

Crustacea

Same RNA

Daphnia m EvgR

White shrimp EvgR

Tiger shrimp TSA

Tiger shrimp EvgRZebra tick EvgR

EvidentialGene mRNA-assembly pipeline

Gene-ome construction with mRNA-seq

How to build perfect mRNA transcript assemblies Does & Don’ts

EvidentialGene http://arthropods.eugenes.org/EvidentialGene/

Genome collaborators and data providers: Daphnia Genome Consortium, Generic Model Organism Database, Indiana Univ. Center for Genomics & Bioinformatics, International Aphid Genomics Consortium, Nasonia jewel wasp Genome project, Cacao chocolate tree Genome project. Funding: NSF mostly, NIH, Mars. Computers: TeraGrid/XSEDE, NCGAS

Gene-omes built from mRNA-seq not genome DNASimple, quick, accurate, less cost, more complete gene sets

Don Gilbert, Biology Dept., Indiana University, Bloomington, IN 47405, [email protected]

2013 June

For the last 2 decades, complete gene sets have been predicted from gene signal statistics in genomic DNA. The advent of high quality, high volume transcript sequencing provides data suited to constructing genes without statistical guesses, from biological gene products. Informatics methods now have caught up to this data, to construct biologically accurate, measurably complete organism gene sets, or transcriptomes. Recently improved mRNA assembly methods of the EvidentialGene project (http://arthropods.eugenes.org/EvidentialGene/) are show here with Crustacean, Insect and Tick examples. These methods are relatively simple, rapid and biologically valid; simpler, quicker and better than genome-based predictions. While not yet in general practice, these are recommended as they yeild large improvements to published mRNA assemblies or genome-based predictions. RNA assembly combined with genome-based modelling gives more complete answers, but gene-centric projects will benefit by allocating more effort to transcript sequencing.

Why mRNA-assembly is better than genome-gene prediction

- Daphnia magna genome-assembly is missing 1/2 expected size to gaps, duplicates. Dmag mRNA genes are now are most complete of crustacea. - Ixodes tick genome genes very fragmented with transposons, Zebra tick mRNA genes much more complete - Pine beetle genome-gene predictions (Maker) well below Pogonus beetle mRNA-genes in homology to reference genes, despite use of ref genes for predictions.

* Genome assembly gaps, mis-assemblies are common, and disrupt gene models. * Transposons are a major problem for gene modelling, and abundant. * Long introns, common w/ transposons, fragment genome-gene models * Genome gene predictor training on valid species genes is essential and difficult, no training for mRNA assembly. * Genome gene prediction remains an educated guess. mRNA-assembly is now engineering with reliably good results given accurate data and methods. * Reference protein mapping needed for genome-genes, but introduces errors. mRNA-assembly without ref proteins can/does produce stronger homology genes with no confounding of evidence. * mRNA-assembly is simple, quick, accurate, less cost, more complete relative to genome-gene prediction.

EvidentialGene tr2aacds.pl is my new, somewhat easy to use pipeline for processing large piles of transcript assemblies into a biologically useful "best" set of mRNA, classified into primary and alternate transcripts. Classification is based primarily on CDS-dna local alignment identity. Transcripts at one locus share exon-sized or larger identities. Perfect fragment CDS are dropped, those with some CDS base differences are kept, with longest CDS as primary transcript. UTR identity is ignored (for now) because many of the mis-assemblies are from joined/mangled genes in UTR region. Algorithm of tr2aacds: 0. collect input transcripts.tr, produce CDS and AA sequences, work mostly on CDS. 1. perfect redundant removal with fastanrdb 2. perfect fragment removal with cd-hit-est 3. blastn, basic local align high-identity subsequences for alternate tr. 4. classify main/alternate cds, okay & drop subsets by CDS-align, protein metrics. 5. output sequence sets from classifier: okay-main, okay-alts, drops. See http://eugenes.org/EvidentialGene/about/EvidentialGene_trassembly_pipe.html Other Evigene scripts for mRNA assembly evigene/scripts/rnaseq/trformat.pl : regularize and unique IDs in transcript.fasta, adding prefixes for parameter sets. evigene/prot/namegenes.pl : add gene function names from UniProt and Conserved Domains (CDD) with delta-blastp. evigene/scripts/rnaseq/asmrna_trimvec.pl : process NCBI vector screen, and trim end gaps in transcripts. evigene/scripts/evgmrna2tsa.pl : check mRNA, add annotation, create public IDs and sequence files, write Genbank TSA format for public submission.

Don’t: Use one assembly program, default options. Do: Use several assembly programs and options, as each will provide some better transcripts/genes. Don’t: Make many assemblies then pick one as a best gene set. Do: make millions of transcript assemblies using multiple kmers, programs and other options. Use all assemblies to pick best mRNA per locus. No single assembly is better for all loci than others, because of the large variation in expression levels, gene sizes, types of read errors, etc., that require different options and methods. Use multiple kmer sizes, from read-size down to 25. Do: Use good RNA de-novo assemblers, even if genome assembly is available. Velvet/Oases, SoapDenovo-Trans, and Trinity produce good assemblies, in that order in my work, and each produces some best assemblies the others miss. Don’t: Use Cufflinks only for genome-mapped RNA. Cufflinks underperforms in assembly versus de-novo assemblers, and has more errors of commision (joins) and omission (missing). Don’t: Select longest transcript as best mRNA, as this selects for errors in assembly. Many methods do this implicitly. Do: Select best mRNA with coding-sequence metrics (longest ORF, complete if possible). Longest proteins correlate strongly with strongest homology to other species. Do: Use at least 200 Million reads of 100bp or longer, mate-paired, to get a complete transcriptome, from current Illimina sequencers. Don’t: Use longer 454 reads, due to high error rate. I've not tested very long reads of new machines, but their errors may cause similar problems. Don’t: expect your species/data set to assemble in same way as others have reported. Don't rely on older software without testing newer, and don't expect newer versions to be better (but often they are). Do: use EvidentialGene scripts and methods for mRNA transcript assembly. The current EvidentialGene_trassembly_pipe is useable by others. http://eugenes.org/EvidentialGene/about/EvidentialGene_trassembly_pipe.html RNA assembly can outstrip computing resources. Large memory and cpu clusters are available as shared resources (NSF-XSEDE, others). Digital normalization and genome-mapped partitioning to assemble very large data sets in parts can help.

New or update Gene-omes from EvidentialGenearthropods/ locust/locust1eg6/ Locusta migratoria; pogonusbeetle/pogonus1eg6/ Pogonus chalceus; whitefly/whitefly1eg6/ Bemisia tabaci; whiteshrimp/litova1eg6/ Litopenaeus vannamei; tigershrimp/shrimpt1eg6/ Penaeus monodon; zebratick/ztick4eg6/ Rhipicephalus pulchellus; plants/banana/banana1eg6/ Musa acuminata; cacao/genes/ Theobroma cacao; vertebrates/catfish/catfish1eg6/ Ictalurus punctatus;