How we revealed genomes secrets?

1

In the name of Allah

How we revealed genomes secrets? Peanut genome as an example

2

The Past

3

19th century

4

20th century

5

20th century

6

20th century

Homo sapiens Genome=

3000000000 base pairs=

3.000.000.000 dollars

7

21th century

8

21th century

>3000000 base pairs

9

21th century

10

NGSNext Generation Sequencing2nd generation of sequencers

Speed, Cost, Sample size, Accuracy

Benefits of FGS over NGSThird generation of sequencers

11

Genome sequencing

STEP 1

Sample preparation

STEP 2

Sequencing

STEP 3

Assembly

STEP 4

Annotation

12

Sample preparation

Solid-phase amplification

Emulsion PCR

Primer immobilized

Template immobilized

Polymerase immobilized

13

Genome sequencing

STEP 1

Sample preparation

STEP 2

Sequencing

STEP 3

Assembly

STEP 4

Annotation

14

Sequencing

Sequencing by synthesis

Sequencing by ligation

Ion Semiconductor

Real-time sequencing Pyrosequencing

15

Genome sequencing

STEP 1

Sample preparation

STEP 2

Sequencing

STEP 3

Assembly

STEP 4

Annotation

16

AssemblyBenefits of FGS over NGS

Longer reads

SensitivityCoverage bias

variation

17

AssemblyUnder ideal Conditions, assembly is:

Simply merging reads with maximal overlap

But

Genome organization is Complex

Coverage Repetitive sequences

length Copy number sequence

18

Assembly99.99% accuracy in euchromatic portion of genome

complete assembly VS. draft assembly

19

Assembly

Assembly algorithms

EARLY STRATEGIES OLC

DE BRUIJN STRING GRAPH

20

OLC assemblyOverlap - Layout - Consensus

21

OLC assembly•Genome resolution increases with read length

•Benefits of whole read length

•Conservative in nature

•Better response to self error correctors

•Human genome was constructed primarily using

OLC algorithms

•Notable OLC based algorithms: Newbler, PCAP,

Arachne, Celera

22

De Bruijn assemblyReplace read with set of K-mers

23

De Bruijn assembly•Require highly accurate reads

•Discards some of the ability for reads to resolve

repeats longer than k-mers

•Don’t require the storage of pairwise overlaps

•Very useful in mammalians MPS projects

•Aggressive in nature

•Error correctors affect their results

•Notable de Bruijn based algorithms: ALLPATHS,

SOAPdenovo, ABySS, Velvet

24

String graph assemblyRelated to A-Bruijn graph

25

String graph assembly

•Don’t decompose reads to k-mers

•Takes the full length of reads

•Benefits both graph and length

•FALCON is an open source implementation of SG

26

SomeAlgorithms

Newbler

AByss Velvet

Celera

SOAPdenovo

Revised to CABOGReduce homopolymer runBuilds unitigs from maximal sample paths

Implements OLC twiceConstruct unitigsConstruct contigs

AlgorithmsSimplificationError removal

Address memory limitationsSimplificationDoesn’t build scaffolds

Uses pre-set tresholdsBubble removal based on coverageUses OLC and DBG techniques

27

Algorithms overview

28

Third Generation of Sequencing

SMS: Single Molecule Sequencing

Quiver Nanocorrect

29

Advances in Bioinformatics and technology

•Reducing computational space requirements

•More sensitive variant detections

•Advances in read length

•Single cell sequencing

•Optical mapping

•Metagenomics

•TruSeq

30

Genome sequencing

STEP 1

Sample preparation

STEP 2

Sequencing

STEP 3

Assembly

STEP 4

Annotation

Annotation

ORFsGO

Variation RNA-Seq

31

Retrotransposons

Structural Functional

32

Genome Browsersa graphical interface for display of genomic data

33

Genome browsers

ApolloIGV

NCBI Genome Workbench

UCSC Genome Browser

Artemis

34

Peanut Genome

STEP 1

Sample preparation

STEP 2

Sequencing

STEP 3

Assembly

STEP 4

Annotation

Introduction

• Arachis hypogaea• 46 million tons• Endemic to South America• Staple food

Genome charasteristics

• Genome size: 2.7 Gb• 40 chromosomes in tetraploid• Repetitive content: 64%• No large change in genome size

since polyploidy

Goal of article

• Sequence and annotate 2 candidate ancestors of peanut

• Sequence peanut transcript• Find real ancestors of peanut• Propose site of peanut

domestication

Methods

• Plant samples• Genome sequencing and assembly• Genome annotation• Transcript sequencing and assembly

Seeds from Brazilian Arachis germplasm collectionIllumina HiSeq 2000Paired-end libraries with 250 bp, 500 bp, 2 kb, 5 kb, 10 kb & 20 kb insert sizes40 kb fasmid based libraries160X coverage in 90-150 bp readsQuality filteringCOPESOAPdenovo 2.05GapcloserTruSeqLinkage maps

Transposons: RepeatmaskerGene prediction: MAKER-PGene duplicationsDR genes: BLASTPGene evolution: DAGchainerSynteny: MUMmer, CViT

FastQCTrinity packageTranscript accuracy estimation: GSNAP

Results

• Sequencing and assembly– 1211 and 1512 Mb for A and B genome– 10 kb scaffolds– Genetic maps to resolve scaffold chimer– Molecular markers to to resolve scaffold order– 96 and 99.2% od the sequence in 10 chromosome

pseudomolecules– 14 BACs usage

Results

• Transposons– 61.7 and 68.5% – Mostly shared– Macroscale: similar– Microscale: abundant differences– LTR retrotransposons: half of each genome– LINEs: highest in plants

Results

Results

• Gene annotation and duplications– 36734 genes in A. duranensis– 41840 genes in A. ipaensis– More local duplications in B genome

Results

• DNA methylation– Whole genome bisulfite sequencing– MethylC-Seq– 8X and 10X coverage– Similar genic methylation patterns

Results

• DR genes– 345 and 397 respectively– Largest clusters in distal regions– Root-knot nematode resistance genes– Rust resistance genes

Results

• Gene evolution– Ks parameter– Ks : 0.95 and 0.90– Divergence in : 2.16 million years ago

Results

• Chromosomal structure and synteny– Mostly symmetrical chromosomes– 2, 3, 4, 10 : colliniar– 5, 6, 9 : large inversion in one arm– 1 : large inversion in both arms– 7 & 8 : complete rearrangment– Distal, Proximal switch

Results

Results

• Comparison with tetraploid peanut– One to one correspondance– 98.3 and 99.9% identity– Genetic recombination in collinear honeologous– Tendency toward A. ipaensis genome– 247000 and 9400 years divergence

Results

Results

• Tetraploid transcript assembly with diploid guide– De novo– Parsing into A and B followed by separate

assembly– Parsing into A and B followed by genome

guided assembly– The last is the most accurate (68.5%

mismach free mapping)

Discussion

• B subgenome nearly identical to: A. ipaensis• The site of occurance of A. ipaensis• Cultivated peanut story

Thank You

How we revealed genomes secrets?

Science

Transcript of How we revealed genomes secrets?