Assembly & Annotation at iPlant
iPlant:Josh Stein (CSHL)Matt Vaughn (TACC)Dian Jiao (TACC)Zhenyuan Lu (CSHL)Nirav Merchant (U. Arizona)Michael Schatz (CSHL)Roger Barthelson (CSHL)
Cantarel et al. 2008. Genome Research 18:188Holt & Yandell. 2011. BMC Bioinformatics 12:491
Maize Genome Project
• Genome– 2500 Mb– 10 chromosomes– 50,000 genes
• Strategy– BAC-by-BAC– 17,000 clones– Finish genic regions
• 9 PI’s
GenBank
Mapping
FPC map
Min. tiling path
BAC selection
U. Arizona
6X shotgun
Sequencing
Auto finish
Manual finishing
Washington U.
c
Annotation
Repeat analysis
Gene prediction
DatabaseBrowser
CSHL
3-yr NSF funded project -- $30 M
Maizesequence.org
3
Technology Lowering Barriers
Assembly & Annotation at iPlant
Complexity of Genomes
Science. 2009 Nov 20;326(5956):1112-5. doi: 10.1126/science.1178534.
Assembling a Genome
3. Simplify assembly graph
1. Shear & Sequence DNA
4. Detangle graph with long reads, mates, and other links
2. Construct assembly graph from overlapping reads…AGCCTAGGGATGCGCGACACGT GGATGCGCGACACGTCGCATATCCGGTTTGGTCAACCTCGGACGGAC
CAACCTCGGACGGACCTCAGCGAA…
Ingredients for a good assembly
Current challenges in de novo plant genome sequencing and assembly
Schatz MC, Witkowski, McCombie, WR (2012) Genome Biology. 12:243
Coverage
High coverage is required– Oversample the genome to
ensure every base is sequenced with long overlaps
between reads– Biased coverage will also
fragment assembly
Read Coverage
Exp
ecte
d C
on
tig
Len
gth
Read Length
Reads & mates must be longer than the repeats– Short reads will have false
overlaps forming hairball assembly graphs
– With long enough reads, assemble entire chromosomes
into contigs
Quality
Errors obscure overlaps– Reads are assembled by finding kmers shared in pair of
reads– High error rate requires very
short seeds, increasing complexity and forming
assembly hairballs
N50 sizeDef: 50% of the genome is in contigs as large as the N50 value
Example: 1 Mbp genome
N50 size = 30 kbp (300k+100k+45k+45k+30k = 520k >= 500kbp)
Note:N50 values are only meaningful to compare when base genome size is the same in all cases
1000
300 45 30100 20 15 1510 . . . . .45
50%
• Attempt to answer the question:“What makes a good assembly?”
• Organizers provided sequence data to assembly experts around the world– Assemblathon 1: ~100Mbp simulated genome– Assemblathon 2: 3 vertebrate genomes each ~1GB
• Results demonstrate trade-offs assemblers must make
Assemblathon 1: A competitive assessment of de novo short read assembly methods.Earl, DA, et al. (2011) Genome Research. doi: 10.1101/gr.126599.111
Assemblathon 2: Evaluating de novo methods of genome assembly in three vertebrate speciesBradnam, KR. et al (2013) GigaScience 2:10 doi:10.1186/2047-217X-2-10
Final Rankings
• ALLPATHS and SOAPdenovo came out neck-and-neck followed closely behind by Celera Assembler, SGA, and ABySS
• My recommendation for “typical” short read assembly is to use ALLPATHS
• Single molecule sequencing becoming extremely attractive if you have access
Apps in Discovery Environment
• Genome Assembly– Allpaths-LG– Soapdenovo2– ABySS– Velvet– Newbler– Ray– Contig analysis tools
• With or without reference sequence for comparison
Assembly Workflow
Upload ReadsMinutes to Months
Quality Assessment
Minutes to Hours
De novo Assembly
Hours to Days
Assembly Assessment
Minutes to Hours
Apps in Discovery Environment
• Sequence Quality Control– FastQC– Fastx Toolkit– Suffixerator/Tallymer/mkindex– Sabre, Scythe, Sickle (paired end
trimming)– SGA cleanup (paired end quality
trimming)– Future plans
• Sequence induction, assessment, and trimming pipeline
• Mira contaminant detection and removal
(for sequencing studies)
QC: FastQC
QC: Read Coverage
Reference:
Reads:
ErrorsCoverage
Repeats
Wheat Genome(A. tauschi / CSHL)
QC: Mer counts
Frag1.fq Frag2.fq
FASTX_fastq-to-fasta
FASTX_fastq-to-fasta
Suffixerator
Suffixerator-Tallymer-mkindex
A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomesKurtz S. Narechania A, Stein JC, Ware D. (2008) BMC Genomics. 9:517
Using Allpaths LG
• You must have at least 2 libraries– One overlapping fragment library, e.g. 100 bp
reads with 180 bp spacing– One jumping mate-pair library, e.g. 3000 bp
spacing
How ALLPATHS-LG works
assembly
reads
unipaths
corrected reads
doubled reads
localized data
local graph assemblies
global graph assembly
Sante Gnerre et al (2010) PNAS 1513–1518, doi: 10.1073/pnas.1017351108
See Youtube:https://www.youtube.com/watch?v=USlTWhmw0oQ&index=3&list=PL-0S9LiUi0viEhYTP_EQtKpYkcYAVW6IH
Where is the sample data?
Data Source: GAGE Project
ALLPATHS-LG in DE
180 bp
3500 bp
Where is the Allpaths LG App?ALLPATHS-LG in DE
Fragment ReadsALLPATHS-LG in DE
Jumping ReadsALLPATHS-LG in DE
Run SettingsALLPATHS-LG in DE
Running ALLPATHS-LG
Post-QC: CEGMA
CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes
Parra G, Bradnam K, Korf I. (2007) Bioinformatics. 23 (9): 1061-1067.
Resources• iPlant
– http://www.iplantcollaborative.org/
• Assembly Competitions– Assemblathon: http://assemblathon.org/– GAGE: http://gage.cbcb.umd.edu/
• Assembler Websites:– ALLPATHS-LG:
http://www.broadinstitute.org/software/allpaths-lg/blog/– SOAPdenovo: http://soap.genomics.org.cn/soapdenovo.html– Celera Assembler: http://wgs-assembler.sf.net
• Tools:– FastQC:
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/– Tallymer: http://www.zbh.uni-hamburg.de/?id=211– CEGMA: http://korflab.ucdavis.edu/datasets/cegma/
What Are Annotations?
• Annotations are descriptions of features of the genome• Structural: exons, introns, UTRs, splice forms etc.
• Coding & non-coding genes
• Expression, repeats, transposons
• Annotations should include evidence trail• Assists in quality control of genome annotations
• Examples of evidence supporting a structural annotation:• Ab initio gene predictions
• ESTs
• Protein homology
Secondary Annotation
• Protein Domains• InterPro Scan: combines many HMM databases
• GO and other ontologies• Pathway mapping
• E.g. BioCyc Pathway tools
Challenges in Plant Genome Annotation• Genomes are BIG • Highly repetitive• Many pseudogenes• Assembly contamination• Incomplete evidence• No method is 100% accurate
Options for Protein-coding Gene Annotation
Yandell & Ence. Nature Reviews Genetics 13, 329-342 (May 2012) | doi:10.1038/nrg3174
Typical Annotation Pipeline
• Contamination screening• Repeat/TE masking• Ab initio prediction• Evidence alignment (cDNA, EST, RNA-seq,
protein)• Evidence-driven prediction• Chooser/combiner• Evaluation/filtering• Manual curation
MAKER-P Automated Pipeline
Ab initio prediction Evidence
MPI-enabled to allow parallel operation on large compute clusters
Collaboration with Yandell Lab
Repeat Library
What is a GFF File?
Generic Feature Format
Quality Control evaluation of the MAKER-P and TAIR10 datasets using Annotation Edit Distance (AED).
Better Quality Worse
• W559 - Annotation of the Lobolly Pine Megagenome—Jill Wegrzyn• 20.15 Gb assembly—split into 40 jobs—216 CPU/job (8640 CPU total)—17 hours
• P157 - Disease Resistance Gene Analysis on Chromosome 11 Across Ten Oryza Species• 10 rice species (each w/12 chromosome pseudomolecules)
• 96 CPU per chromosome (1152 CPU total) ~ 2hr per genome
36
22,656 CPU cores on1,888 nodes
Genome AssemblySize (Mb) CPU
Run Time
Arabidopsis thaliana TAIR10 120 600 2:44Arabidopsis thaliana TAIR10 120 1500 1:27Zea mays RefGen_v2 2067 2172 2:53
TACC Lonestar Supercomputer
Campbell et al. Plant Physiology. December 4, 2013, DOI:10.1104/pp.113.230144
PAG 2014:
MAKER-P at iPlant
MAKER-P at iPlant
• Virtual image
• MPI-enabled for parallel computing
• Check out with up to 16 CPU
• Tested with 4 CPU instance
• Completed rice chr 1 in 8 hr 45 min
37
Atmosphere: MAKER_2.28 (emi-F13821D0)
MAKER-P Tutorial
https://pods.iplantcollaborative.org/wiki/display/sciplant/MAKER-P+Atmosphere+Tutorial
Annotation Post-Analysis
• AED threshold• InterProScan• Comparative analysis, e.g. BLAST vs RefSeq
proteins
Annotation Post-Analysis
InterProScan
Assembly & Annotation at iPlant
Additional MAKER-P Resources
• MAKER-P: http://www.yandell-lab.org/software/maker-p.html
• Repeat Library contstuction: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Advanced
• Pseudogene identification: http://shiulab.plantbiology.msu.edu/wiki/index.php/Protocol:Pseudogene
Top Related