Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

27
Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0) Prashant S Hosmani , Surya Saha, Mirella Flores, Stephane Rombauts, Florian Maumus, Henri van de Geest, Gabino Sanchez- Perez and Lukas Mueller Boyce Thompson Institute, Ithaca, NY VIB Department of Plant Systems Biology, Ghent University, Gent, Belgium URGI, INRA, Université Paris-Saclay, Versailles, France Wageningen Plant Research, Wageningen University, Netherlands [email protected]

Transcript of Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

Page 1: Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

Improvements in the Tomato Reference

Genome (SL3.0) and Annotation

(ITAG3.0)

Prashant S Hosmani, Surya Saha, Mirella Flores, Stephane

Rombauts, Florian Maumus, Henri van de Geest, Gabino Sanchez-

Perez and Lukas Mueller

Boyce Thompson Institute, Ithaca, NY

VIB Department of Plant Systems Biology, Ghent University, Gent, Belgium

URGI, INRA, Universite ́ Paris-Saclay, Versailles, France

Wageningen Plant Research, Wageningen University, Netherlands

[email protected]

Page 2: Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

Acknowledgements

Gabino Sanchez

Henri van de Geest

SGN Community (You!)

RNAseq data contributors

Stephane Rombauts

Florian Maumus

Page 3: Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

SL3.0

Solanum lycopersicum

Heinz 1706

Page 4: Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

BAC Integration Workflow

Automatic

integration of BACsManual validation NCBI validation

https://github.com/solgenomics/Bio-GenomeUpdate

BAC

assemblies

Align to SL2.50

• 500bp BAC ends

• 100% identity

Place

BACs

1,069 full-length phase htgs3 BACs integrated and

~11Mb of contig gaps removed

Page 5: Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

BioNano Workflow

Assemble molecules

into CMaps

Hybrid assembly with

NGS scaffoldsManual validation

Hybrid assembly statistics

Scaffolds: 57

Total Genome Map Length: 779.789 Mb

Avg. Genome Map Length: 13.681 Mb

Genome Map N50: 25.384 Mb

Page 6: Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

Chr00 Integration

Chr00

Chr02

Cmap 84

• Chr00 contig NW_004194391.1 (203,142bp) inserted in chr09 150kb scaffold gap

• Two Inversions on chromosome 12

• 19 gaps resized

Chr00 contig NW_004194387.1 (561,203bp) integrated in 1.4Mb scaffold gap

Page 7: Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

ITAG3.0

Annotation

Page 8: Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

Structural annotation pipeline

Repeat masking

genome

Evidence – RNA

and protein

ITAG 2.40 gene

models

Post-processing

• Genes with functional domain support

• Assign Solyc-ID to novel genes

Page 9: Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

Repeat identification and masking the

genome

• Generated custom repeat libraryRepeatModeler

• Exclusion of repeats with similarity with known proteins (SwissProt)

ProtExcluder

• Masked 56.39% genomeRepeatMasker

Page 10: Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

Repeat identification and classification

Extensive identification and classification of repeats using

REPET, which masks 61% of the SL3.0 reference

genome.

Florian Maumus

Page 11: Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

ITAG 2.40 processing

• ITAG2.40 protein-coding genes34,725

• Webapollo curated genes

• Removed contamination (56)

• Removed transposon (2,244)32,425

• ITAG2.40 mapped - GMAP

• Mapped to SL3.0 repeat masked genome

31,309

Page 12: Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

Expression evidence for annotation

Expression data evidence

• 8 billion RNAseq reads

• Tissue and treatment specific RNAseq

• 5’ and 3’ UTR enriched RNAseq

• RENseq for NBS-LRR genes

• Pacbio Iso-seq data

• SwissProt plant proteins

Mapped on to SL3.0 and transcriptome was assembled

Mapping rate ~85%

Page 13: Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

RNAseq data sources

• Jim Giovannoni (BTI/USDA)

• Jocelyn Rose (Cornell)

• Greg Martin (BTI)

• Zhangjun Fei (BTI/USDA)

• Jonathan Jones (The Sainsbury Laboratory)

• Asaph Aharoni (Weizmann Institute of Science)

• Neelima Sinha (University of California, Davis)

Page 14: Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

MAKER pipeline

Ab-initio gene prediction methods

• Augustus (Training using BRAKER1)

• SNAP (MAKER based training)

• GeneMark (with high quality genes)

• Eugene (Stephane Rombauts)

Updating legacy annotation (ITAG2.40)

Post-processing

Added genes only with functional domain support (Pfam) ~800 genes

Removed genes with 70% overlap with repeats (674 genes).

Assigned Solyc ID to novel genes with ITAG convention.

Novel genes are assigned Solyc ID between existing Solyc ID.

Page 15: Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

Improvements in ITAG 3.0 compared with

ITAG 2.40

ITAG 2.40 ITAG 3.0

# of genes 34,725 34,769

Avg. gene length 1,209 bp 1,529 bp

Exons per gene 4.61 5.10

5’ UTR per gene 0.39 0.63

3’ UTR per gene 0.44 0.62

Novel genes in ITAG3.0 – 5,822

Page 16: Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

Gene structure improvement example

ITAG3.0

ITAG2.40

ITAG3.0

ITAG2.40

Correct fusion example

UTR example

RNAseq

XY plot

RNAseq

XY plot

Page 17: Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

Quality check - Annotation Edit Distance

(AED)

AED= 0 complete support

AED =1 lack of support

AED

Page 18: Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

Functional annotation

Automated Assignment of Human Readable Descriptions (AHRD)

Swissprot plant protein database

TrEMBL plant protein database

Araport 11 (Arabidopsis latest annotation)

User curated locus information from solgenomics.net (2000+)

Unknown proteins

In ITAG 3.0, 409 have a functional description of “Unknown proteins” compared to 7,689 in ITAG2.40

Page 19: Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

Functional annotation

Automated Assignment of Human Readable Descriptions (AHRD)

AHRD-Version 3.3.2

Quality score (***)Solyc08g081780.1.1 Dirigent protein (***)

Solyc01g008960.2.1 Argonaute family protein (***)

Solyc01g013880.1.1 Leucine-rich repeat receptor-like protein kinase family protein (*-*)

Position Criteria

1 Bit score of the blast result is >50 and e-value is <e-10

2 Alignment of the blast result is >60%

3 Human Readable Description score is >0.5

“AHRD’s quality-code consists of a three character string, where each

character is either ‘*’ if the respective criteria is met or ‘-’ otherwise.”

Page 20: Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

Novel genes in ITAG3.0

5,822 novel gens in ITAG 3.0

Page 21: Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

Future work

Genome

Improving genome assembly by sequencing with Pacbio

technology

Annotation

tRNA, non-coding RNA annotation

Multiple isoforms

Co-expression network based functional annotation

Page 22: Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

Workshop: SGN and RTB DatabasesTuesday, Jan 17 10:30 AM

PostersSurya Saha: Improved Tomato Genome Reference (SL3.0) using Full-Length BACs, BioNano Optical Maps and SGN Community Resources (P0798)

Prashant Hosmani: ITAG3.0 Annotation for the New Tomato Reference Genome SL3.0 (P0797)

Page 23: Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

Thank you!!

Questions??

Page 24: Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

Data available to download from

FTP

• ITAG 3.0

• GFF, proteins, transcripts, CDS

• List of fused genes

SGN Workshop, SOL 2016

Page 25: Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

Gap Reduction

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

0

50

100

150

200

250

300

350

400

450

500

1 2 3 4 5 6 7 8 9 10 11 12

BACs Reduction in contig gaps

BA

Cs Inte

gra

ted

Page 26: Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

Repeat classification

SGN Workshop, SOL 2016

LTR retrotransposon

Copia 64840935

Gypsy 260719161

TRIM/LARD 671571

Non-LTR retrotransposon LINE 9871924

Putative_retrotransposon Putative_RT 528982

DNA DNA 20712725

Helitron Helitron 1210271

TIR TIR 12144035

Confused Confused 48373586

Unclassified Unclassified 70850157

Hostgene Endogenous virus 5839457

Tandem repeats Hostgene 5044454

Tandem repeats 8901715

Ns SUM repeats 509708973

Page 27: Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

Mapping rates for different RNAseq data

RNAseq data # of reads in

Millions

REPET light RepeatModeler

light

AC_Jim 637 86.87% 88.03%

epigenome 82 60.77% 64.35%

UTR seq 87 85.88% 86.57%

TEA part A 4,295 84.41% 84.39%

TEA part B 2,449 84.40% 84.71%

RENseq 15 32.91% 39.83%

Yang 331 79.94% 80.28%

Total reads 7,930