Download - Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz.

Assembly & Annotation at iPlant

iPlant:Josh Stein (CSHL)Matt Vaughn (TACC)Dian Jiao (TACC)Zhenyuan Lu (CSHL)Nirav Merchant (U. Arizona)Michael Schatz (CSHL)Roger Barthelson (CSHL)

Cantarel et al. 2008. Genome Research 18:188Holt & Yandell. 2011. BMC Bioinformatics 12:491

Maize Genome Project

• Genome– 2500 Mb– 10 chromosomes– 50,000 genes

• Strategy– BAC-by-BAC– 17,000 clones– Finish genic regions

• 9 PI’s

GenBank

Mapping

FPC map

Min. tiling path

BAC selection

U. Arizona

6X shotgun

Sequencing

Auto finish

Manual finishing

Washington U.

c

Annotation

Repeat analysis

Gene prediction

DatabaseBrowser

CSHL

3-yr NSF funded project -- $30 M

Maizesequence.org

3

Technology Lowering Barriers

Complexity of Genomes

Science. 2009 Nov 20;326(5956):1112-5. doi: 10.1126/science.1178534.

Assembling a Genome

3. Simplify assembly graph

1. Shear & Sequence DNA

4. Detangle graph with long reads, mates, and other links

2. Construct assembly graph from overlapping reads…AGCCTAGGGATGCGCGACACGT GGATGCGCGACACGTCGCATATCCGGTTTGGTCAACCTCGGACGGAC

CAACCTCGGACGGACCTCAGCGAA…

Ingredients for a good assembly

Current challenges in de novo plant genome sequencing and assembly

Schatz MC, Witkowski, McCombie, WR (2012) Genome Biology. 12:243

Coverage

High coverage is required– Oversample the genome to

ensure every base is sequenced with long overlaps

between reads– Biased coverage will also

fragment assembly

Read Coverage

Exp

ecte

d C

on

tig

Len

gth

Read Length

Reads & mates must be longer than the repeats– Short reads will have false

overlaps forming hairball assembly graphs

– With long enough reads, assemble entire chromosomes

into contigs

Quality

Errors obscure overlaps– Reads are assembled by finding kmers shared in pair of

reads– High error rate requires very

short seeds, increasing complexity and forming

assembly hairballs

N50 sizeDef: 50% of the genome is in contigs as large as the N50 value

Example: 1 Mbp genome

N50 size = 30 kbp (300k+100k+45k+45k+30k = 520k >= 500kbp)

Note:N50 values are only meaningful to compare when base genome size is the same in all cases

1000

300 45 30100 20 15 1510 . . . . .45

50%

• Attempt to answer the question:“What makes a good assembly?”

• Organizers provided sequence data to assembly experts around the world– Assemblathon 1: ~100Mbp simulated genome– Assemblathon 2: 3 vertebrate genomes each ~1GB

• Results demonstrate trade-offs assemblers must make

Assemblathon 1: A competitive assessment of de novo short read assembly methods.Earl, DA, et al. (2011) Genome Research. doi: 10.1101/gr.126599.111

Assemblathon 2: Evaluating de novo methods of genome assembly in three vertebrate speciesBradnam, KR. et al (2013) GigaScience 2:10 doi:10.1186/2047-217X-2-10

Final Rankings

• ALLPATHS and SOAPdenovo came out neck-and-neck followed closely behind by Celera Assembler, SGA, and ABySS

• My recommendation for “typical” short read assembly is to use ALLPATHS

• Single molecule sequencing becoming extremely attractive if you have access

Apps in Discovery Environment

• Genome Assembly– Allpaths-LG– Soapdenovo2– ABySS– Velvet– Newbler– Ray– Contig analysis tools

• With or without reference sequence for comparison

Assembly Workflow

Upload ReadsMinutes to Months

Quality Assessment

Minutes to Hours

De novo Assembly

Hours to Days

Assembly Assessment

Minutes to Hours

Apps in Discovery Environment

• Sequence Quality Control– FastQC– Fastx Toolkit– Suffixerator/Tallymer/mkindex– Sabre, Scythe, Sickle (paired end

trimming)– SGA cleanup (paired end quality

trimming)– Future plans

• Sequence induction, assessment, and trimming pipeline

• Mira contaminant detection and removal

(for sequencing studies)

QC: FastQC

QC: Read Coverage

Reference:

Reads:

ErrorsCoverage

Repeats

Wheat Genome(A. tauschi / CSHL)

QC: Mer counts

Frag1.fq Frag2.fq

FASTX_fastq-to-fasta

FASTX_fastq-to-fasta

Suffixerator

Suffixerator-Tallymer-mkindex

A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomesKurtz S. Narechania A, Stein JC, Ware D. (2008) BMC Genomics. 9:517

Using Allpaths LG

• You must have at least 2 libraries– One overlapping fragment library, e.g. 100 bp

reads with 180 bp spacing– One jumping mate-pair library, e.g. 3000 bp

spacing

How ALLPATHS-LG works

assembly

reads

unipaths

corrected reads

doubled reads

localized data

local graph assemblies

global graph assembly

Sante Gnerre et al (2010) PNAS 1513–1518, doi: 10.1073/pnas.1017351108

See Youtube:https://www.youtube.com/watch?v=USlTWhmw0oQ&index=3&list=PL-0S9LiUi0viEhYTP_EQtKpYkcYAVW6IH

Where is the sample data?

Data Source: GAGE Project

ALLPATHS-LG in DE

180 bp

3500 bp

http://gage.cbcb.umd.edu/data/index.html

Where is the Allpaths LG App?ALLPATHS-LG in DE

Fragment ReadsALLPATHS-LG in DE

Jumping ReadsALLPATHS-LG in DE

Run SettingsALLPATHS-LG in DE

Running ALLPATHS-LG

Post-QC: CEGMA

CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes

Parra G, Bradnam K, Korf I. (2007) Bioinformatics. 23 (9): 1061-1067.

Resources• iPlant

– http://www.iplantcollaborative.org/

• Assembly Competitions– Assemblathon: http://assemblathon.org/– GAGE: http://gage.cbcb.umd.edu/

• Assembler Websites:– ALLPATHS-LG:

http://www.broadinstitute.org/software/allpaths-lg/blog/– SOAPdenovo: http://soap.genomics.org.cn/soapdenovo.html– Celera Assembler: http://wgs-assembler.sf.net

• Tools:– FastQC:

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/– Tallymer: http://www.zbh.uni-hamburg.de/?id=211– CEGMA: http://korflab.ucdavis.edu/datasets/cegma/

http://www.iplantcollaborative.org/

http://assemblathon.org/

http://gage.cbcb.umd.edu/

http://www.broadinstitute.org/software/allpaths-lg/blog/

http://soap.genomics.org.cn/soapdenovo.html

http://wgs-assembler.sf.net/

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

http://www.zbh.uni-hamburg.de/?id=211

http://korflab.ucdavis.edu/datasets/cegma/

What Are Annotations?

• Annotations are descriptions of features of the genome• Structural: exons, introns, UTRs, splice forms etc.

• Coding & non-coding genes

• Expression, repeats, transposons

• Annotations should include evidence trail• Assists in quality control of genome annotations

• Examples of evidence supporting a structural annotation:• Ab initio gene predictions

• ESTs

• Protein homology

Secondary Annotation

• Protein Domains• InterPro Scan: combines many HMM databases

• GO and other ontologies• Pathway mapping

• E.g. BioCyc Pathway tools

Challenges in Plant Genome Annotation• Genomes are BIG • Highly repetitive• Many pseudogenes• Assembly contamination• Incomplete evidence• No method is 100% accurate

Options for Protein-coding Gene Annotation

Yandell & Ence. Nature Reviews Genetics 13, 329-342 (May 2012) | doi:10.1038/nrg3174

Typical Annotation Pipeline

• Contamination screening• Repeat/TE masking• Ab initio prediction• Evidence alignment (cDNA, EST, RNA-seq,

protein)• Evidence-driven prediction• Chooser/combiner• Evaluation/filtering• Manual curation

MAKER-P Automated Pipeline

Ab initio prediction Evidence

MPI-enabled to allow parallel operation on large compute clusters

Collaboration with Yandell Lab

Repeat Library

What is a GFF File?

Generic Feature Format

Quality Control evaluation of the MAKER-P and TAIR10 datasets using Annotation Edit Distance (AED).

Better Quality Worse

• W559 - Annotation of the Lobolly Pine Megagenome—Jill Wegrzyn• 20.15 Gb assembly—split into 40 jobs—216 CPU/job (8640 CPU total)—17 hours

• P157 - Disease Resistance Gene Analysis on Chromosome 11 Across Ten Oryza Species• 10 rice species (each w/12 chromosome pseudomolecules)

• 96 CPU per chromosome (1152 CPU total) ~ 2hr per genome

36

22,656 CPU cores on1,888 nodes

Genome AssemblySize (Mb) CPU

Run Time

Arabidopsis thaliana TAIR10 120 600 2:44Arabidopsis thaliana TAIR10 120 1500 1:27Zea mays RefGen_v2 2067 2172 2:53

TACC Lonestar Supercomputer

Campbell et al. Plant Physiology. December 4, 2013, DOI:10.1104/pp.113.230144

PAG 2014:

MAKER-P at iPlant

MAKER-P at iPlant

• Virtual image

• MPI-enabled for parallel computing

• Check out with up to 16 CPU

• Tested with 4 CPU instance

• Completed rice chr 1 in 8 hr 45 min

37

Atmosphere: MAKER_2.28 (emi-F13821D0)

MAKER-P Tutorial

https://pods.iplantcollaborative.org/wiki/display/sciplant/MAKER-P+Atmosphere+Tutorial

Annotation Post-Analysis

• AED threshold• InterProScan• Comparative analysis, e.g. BLAST vs RefSeq

proteins

Annotation Post-Analysis

InterProScan

Additional MAKER-P Resources

• MAKER-P: http://www.yandell-lab.org/software/maker-p.html

• Repeat Library contstuction: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Advanced

• Pseudogene identification: http://shiulab.plantbiology.msu.edu/wiki/index.php/Protocol:Pseudogene

http://www.yandell-lab.org/software/maker-p.html



http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Advanced




http://shiulab.plantbiology.msu.edu/wiki/index.php/Protocol:Pseudogene