Bits of the Green Junk

61
Barbados Workshop on the Computational Identification and Analysis of Transposable Elements April 18th - 25th, 2014 Florian Maumus with Hadi Quesneville (URGI-INRA, Versailles, France)

description

By Florian Maumus and Hadi Quesneville We present our opinions, recent developments and perspectives regarding whole-genome repeatome annotation. This talk was presented by Florian Maumus at the Barbados Workshop on the Computational Identification and Analysis of Transposable Elements, Holetown, Barbados, April 18-24 2014

Transcript of Bits of the Green Junk

Page 1: Bits of the Green Junk

Barbados Workshop on the Computational Identification and Analysis of Transposable Elements

April 18th - 25th, 2014

Florian Maumus with Hadi Quesneville (URGI-INRA, Versailles, France)

Page 2: Bits of the Green Junk
Page 3: Bits of the Green Junk

REPET package

TEdenovo TEannotGenome Repeat annotation

HadiQuesneville

Page 4: Bits of the Green Junk

De novo repeatome detectionDeep repeatome annotation

Repeat annotation in large genomes

Page 5: Bits of the Green Junk

De novo repeatome detectionDeep repeatome annotation

Repeat annotation in large genomes

Page 6: Bits of the Green Junk
Page 7: Bits of the Green Junk

7

Repeat complement = Repeatome

The Repeatome includes:Transposable elements

Endogenous virusesTandem repeats

RibozymesGenes

= What you get with repeat-finders!

Page 8: Bits of the Green Junk

Burst and Decay

Page 9: Bits of the Green Junk

« Repeats » Old repeats Dark matter

Dark matter, the genomic humus

Detected Detectable? Background Noise

Burst Decay Melt

Page 10: Bits of the Green Junk

Turnover ++Recent activity +++

Turnover -Recent activity -

Complexity of the repeatome

old

young

Page 11: Bits of the Green Junk

Maize2.3 Gb genomeAbout 85% repeats

Human3.2 Gb genomeAbout 50% repeats

Different history, different challenges

Page 12: Bits of the Green Junk

LECA:Core eukaryotic genes +Copia, Gypsy, LINEs, DNA transposons…

TEs have been jumping around genes over evolutionary times

Page 13: Bits of the Green Junk
Page 14: Bits of the Green Junk

Contents include: Professional Tool Roll

Archaeology Margin TrowelBattiferro Leaf & Square

Battiferro forged ornamental tools lanceBattiferro Trowel and Square

Aluminium scale rulersSmall Tools Set

Hand ShovelSmall BrushMason Line*

Line PegsLine LevelPlumb BobRetractable

Hi-Viz Grip KnifeBattiferro Trowel*

*Optional.

Archeology toolbox

Page 15: Bits of the Green Junk
Page 16: Bits of the Green Junk

Repeatome toolbox

K-mer strict : Tallymer, DSK

K-mer based : RepeatScout, P-clouds

Similarity, e.g Recon

CombinedRepeatModeler (RepeatScout + Recon)

TEdenovo (Recon + Piler + Grouper; + RepeatScout in v2.2)

Page 17: Bits of the Green Junk

REPET: TEdenovo

TEdenovo pipeline Consensus library

+ RepeatScout (v2.2)REPET Classification utility

Page 18: Bits of the Green Junk

REPET Classification tool

Consensus library

TR searchTandem Repeat Finder

BLASTxtBLASTxRepbase

Pfam hmmGyDB hmm

Consensus 1: termLTRs 0,12% TR Bx: AtGypsy; Btx: none profiles: IN, RT LTR retro

Consensus 2: none 0,32% TR Bx: none; Btx: none profiles: LRR Host gene

Consensus 3: none 0,23% TR Bx: none; Btx: none profiles: none Unclassified

rDNAtRNA

Host genes

Summary of evidences Proposed Classification

Page 19: Bits of the Green Junk

TEannot pipeline genome annotation

REPET: TEannot

Page 20: Bits of the Green Junk

TEdenovo

RepeatScout

RepeatModeler

Performance, Complementarity ?

Page 21: Bits of the Green Junk

Experimental model

Arabidopsis thaliana120 Mb

Page 22: Bits of the Green Junk

Consensus sequences

Page 23: Bits of the Green Junk

0 10 20 30 40 50 Mb

Genome coverage

Sensitivity & Specificity

Page 24: Bits of the Green Junk

Tallymer

TRF

TEdenovo

RepeatModeler

RepeatScout

TEdenovo+RS+RM

All

0 10 20 30 40 50 60 70 80 90 100

Percent reference coverage

Sensitivity

Page 25: Bits of the Green Junk

TEdenovo

RepeatModeler

RepeatScout

TEdenovo+RS+RM

All

0 10 20 30 40 50 60 70 80 90 100

Percent 24-nt sRNA coverage (Lister et al., 2008)

Biological Sensitivity

Page 26: Bits of the Green Junk

TEdenovo RepeatModeler RepeatScout0

5

10

15

20

25

30

35

Gen

ome

cove

rage

incr

ease

(%

)

REPET, RepeatScout, and RepeatModeler employ complementary computational methods that together enable to better represent repeatome complexity.

Page 27: Bits of the Green Junk

Conclusions I

TEdenovo outcompetes RepeatModeler and RepeatScout Greater coverage with

Less consensus Larger consensusLarger copies

Complementarity of TEdenovo, RepeatModeler and RepeatScoutComprehensive annotation of complex repeatomes

Page 28: Bits of the Green Junk

De novo repeatome detectionDeep repeatome annotation

Repeat annotation in large genomes

Page 29: Bits of the Green Junk

Arabidopsis120 Mb

CDS Repeatome Dark matter

0% 100%

Experimental model

Three strategies with REPET:Annotate genome with genomic copies

Use relaxed parameters for HSP detection

Use P-clouds to detect short repeat fragments

Page 30: Bits of the Green Junk

Iterative annotationAnnotate genome with genomic copies

(Expand the knowledge)

Page 31: Bits of the Green Junk

Iterative annotationAnnotate genome with genomic copies

(Expand the knowledge)

Page 32: Bits of the Green Junk

Iterative annotationAnnotate genome with genomic copies

(Expand the knowledge)

Page 33: Bits of the Green Junk

Genome

Consensus

Genomic copies

Genomic copies

Genomic copies

Genomic copies

TEannot

TEannot

TEannot

TEdenovo

TEannot

Iterative annotationAnnotate genome with genomic copies

Page 34: Bits of the Green Junk

RepeatModeler

RepeatScout

Tallymer

24-nt sRNA

Reference

Genome

0 10 20 30 40 50 60 70 80 90 100

TEdenovo_1

TEdenovo_2

TEdenovo_3

TEdenovo_4

Iterative annotationAnnotate genome with genomic copies

Page 35: Bits of the Green Junk

AAAC

AG

AT

CA

CC

CG

CTGA

GC

GG

GT

TA

TC

TG

TT

-0,05

0,05

0,15

CDS

TEdenovo

delta_2vs1

delta_3vs2

delta_4vs3

Dinucleotide composition

Page 36: Bits of the Green Junk

Relevance

Genome annotation using the delta_2vs1 copies

masks as much as 23 Mb (19.5%) of the genome

Covers 66% of the reference annotationand 56% of the TEdenovo annotation

The supplementary annotations from TEdenovo_2 are highly representative of the A. thaliana repeatome.

Page 37: Bits of the Green Junk

Relaxed (parameters) annotation

Page 38: Bits of the Green Junk

Relaxed (parameters) annotation

Page 39: Bits of the Green Junk

Relaxed (parameters) annotation

Default : Identity > 90%, Evalue<1e-300Cool : Identity > 85%, Evalue < 1e-50Soft : Identity > 80%, Evalue < 1e-20

Consensus size

Page 40: Bits of the Green Junk

RepeatModeler

RepeatScout

Reference

Tallymer

24 nt sRNA

0 10 20 30 40 50 60 70 80 90 100

TEdenovo_1

TEdenovo_cool

TEdenovo_soft

TEdenovo_soft_2

Relaxed (parameters) annotation

Page 41: Bits of the Green Junk

TEdenovo

Cool

Soft

Copy/consensus identity along chr1

()

Page 42: Bits of the Green Junk

Deep annotation of the A. thaliana repeatome

RepeatScout

RepeatModeler

TEdenovo

Repbase(+Buisine et al.)

Remove redundancy

Bundle libraryTEannot

Consensus size

Page 43: Bits of the Green Junk

selectednot

selected

Deep annotation of the A. thaliana repeatome

TEannot

P-clouds

Complete bundle

annotation

Page 44: Bits of the Green Junk

P-cloudsCopies Consensus

In-cloud k-mers

De Koning et al.

Page 45: Bits of the Green Junk
Page 46: Bits of the Green Junk

• TEdenovo

Page 47: Bits of the Green Junk

• Bundle

Page 48: Bits of the Green Junk

=> Repeated and repeat-derived sequences contribute at least 30% to the A. thaliana genome

Enhanced repeat detection in gene-rich regions

• Bundle + P-clouds

Page 49: Bits of the Green Junk

Arabidopsis repeats browser

Deep annotations

REPET

RepeatModeler

RepeatScout

Buisine et al.

24-nt sRNA

Genes

Page 50: Bits of the Green Junk

Conclusions II

Innovative approaches for deep repeatome annotation

About one third of the A. thaliana genome of repetitive origin (vs 24%)Increased sensitivity and detection of old repeat remnants Improved genome evolution and epigenetic analyses

Continuum between repeatome and genomic dark matter

Time

Page 51: Bits of the Green Junk

De novo repeatome detectionDeep repeatome annotation

Repeat annotation in large genomes

Page 52: Bits of the Green Junk

All genomes should benefit the greater quality of TEdenovo

Adapted from Nina V. Fedoroff (2012) and Steven M. Carr

Page 53: Bits of the Green Junk

Limitations with REPET

All-by-all genome comparison => LOTS (Gb) of high scoring pairs (HSPs)

HSP files > 1 Gb are not handled by Piler

Grouper can last for weeks

Impossible to run TEdenovo on whole large and/or highly repeated genomes until recently

Page 54: Bits of the Green Junk

Solutions

Use a sample of whole genome as input for TEdenovo (e.g. 300Mb)

(As recommended for RepeatModeler)

Page 55: Bits of the Green Junk

Tomato genomes

S. pennellii : 942 Mb

S. lycopersicum : 782 Mb

Page 56: Bits of the Green Junk

TEdenovo (n HSP >= 5)

0 0.5 1 Gb

Consensus library

320 Mb input

TEannot

Page 57: Bits of the Green Junk

Mb

82% of the Solanum pennellii ATGC space masked

Page 58: Bits of the Green Junk

Conclusions III

Efficient annotation of large plant genomes with REPET

Still quite a long process !

Page 59: Bits of the Green Junk

De novo repeat annotation in large genomes

Future developments

Parallelize Grouper

Parallelize the “Long join” procedure

Establish phyla-specific approaches

Develop strategies to annotate genomes with different composition

old, complex repeatomes as compared to large plant genomes

Page 60: Bits of the Green Junk

De novo repeat annotation in large genomes

Future challenges & perspectives

Propose TEdenovo and TEannot pipelines on GALAXY

Deliver REPET compilation for use on a cloud

Page 61: Bits of the Green Junk

Véronique Jamilloux

Tina Alaeitabar

TimothéeChaumier

Mark Moissette

Olivier Inizan

HadiQuesneville

THANK YOU !