Climbing Mt. Metagenome

1. Scaling Mt. Metagenome:Assembling very large data sets
C. Titus Brown
Assistant Professor
Computer Science and Engineering /
Microbiology and Molecular Genetics
Michigan State University

2. Thanks for coming!
Note: this talk is about the computational side of metagenome assembly, motivated by the Great Prairie Grand Challenge soil sequencing project.
Jim Tiedje will talk about the project as a whole at the JGI Users Meeting.
3. The basic problem.
Lots of metagenomic sequence data
(200 GB Illumina for< $20k?)
Assembly, especially metagenome assembly, scales poorly (due to high diversity).
Standard assembly techniques dont work well with sequences from multiple abundance genomes.
Many people dont have the necessary computational resources to assemble (~1 TB of RAM or more, if at all).
4. We cant just throw more hardware at the problem, either.
Lincoln Stein
5. Jumping to the end:
We have implemented a solution for these problems:
Scalability of assembly,
Lack of resources,
and parameter choice.
We demonstrate this solution for a high diversity sample (219.1 Gb of Iowa corn field soil metagenome).
there is an additional surprise or two, so you should stick around!
6. Whole genome shotgun sequencing & assembly
Randomly fragment & sequence from DNA;
reassemble computationally.
UMD assembly primer (cbcb.umd.edu)
7. K-mer graphs - overlaps
J.R. Miller et al. / Genomics (2010)
8. K-mer graphs - branching
For decisions about which paths etc, biology-based heuristics come into play as well.
9. Too much data what can we do?
Reduce the size of the data (either with an approximate or an exact approach)
Divide & conquer: subdivide the problem.
For exact data reduction or subdivision, need to grok the entire assembly graph structure.
but that is why assembly scales poorly in the first place.
10. 11. 12. 13. Abundance filtering
Approach used in two published Illumina metagenomic papers (MetaHIT/human microbiomeand rumen papers)
Remove or trim reads with low-abundance k-mers
Either due to errors, or low-abundance organisms.
Inexact data reduction: may or may not remove usable data.
Works well for high-coverage data sets (rumen est56x!!)
However, for low-coverage or high-diversity data sets, abundance filtering will reject potentially useful reads.
14. Abundance filtering
15. Two exact data reduction techniques:
Eliminate reads that do not connect to many other reads.
Group reads by connectivity into different partitions of the entire graph.
For k-mer graph assemblers like Velvet and ABYSS, these are exactsolutions.
16. Eliminating unconnected reads
Graphsize filtering
17. Subdividing reads by connection
Partitioning
18. Two exact data reduction techniques:
Eliminate reads that do not connect to many other reads (graphsize filtering).
Group reads by connectivity into different partitions of the entire graph (partitioning).
For k-mer graph assemblers like Velvet and ABYSS, these are exactsolutions.
19. Engineering overview
Built a k-mer graph representation based on Bloom filters, a simple probabilistic data structure;
With this, we can store graphs efficiently in memory, ~1-2 bytes/(unique) k-mer for arbitrary k.
Also implemented efficient global traversal of extremely large graphs (5-20 bn nodes).
For details see source code (github.com/ctb/khmer), or online webinar: http://oreillynet.com/pub/e/1784
20. Store graph nodes in Bloom filter
Graph traversal is done in full k-mer space;
Presence/absence of individual nodes is kept in Bloom filter data structure (hash tables w/o collision tracking).
21. Practical application
Enables:
graph trimming (exact removal)
partitioning (exact subdivision)
abundance filtering
all for K 10 reads; assembled
partitions separately (k0=32, k=33).
N contigs/ Total bpLargest contig
130 223,341 61,766Unfiltered (35m)
130 223,341 61,766Sum partitions
YES.
25. Data reduction for assembly / practical details
Reduction performed on machine with 16 gb of RAM.
Removing poorly connected reads: 35m -> 2m reads.
- Memory required reduced from 40 gb to 2 gb;
- Time reduced from 4 hrs to 20 minutes.
Partitioning reads into disconnected groups:
- Biggest group is 300k reads
- Memory required reduced from 40 gb to 500 mb;
- Time reduced from 4 hrs to < 5 minutes/group.
26. Does it work on bigger data sets?
35 m read data set partition sizes:
P1: 277,043 reads
P2: 5776 reads
P3: 4444 reads
P4: 3513 reads
P5: 2528 reads
P6: 2397 reads

Iowa continuous corn GA2 partitions (218.5 m reads):
P1: 204,582,365 reads
P2: 3583 reads
P3: 2917 reads
P4: 2463 reads
P5: 2435 reads
P6: 2316 reads

27. Problem: big data sets have one big partition!?
Too big to handle on EC2.
Assembles with low coverage.
Contains 2.5 bn unique k-mers (~500microbial genomes), at ~3-5x coverage
As we sequence more deeply, the lump becomes bigger percentage of reads => trouble!
Both for our approach,
And possibly for assembly in general (because it assembles more poorly than it should, for given coverage/size)
28. Why this lump?
Real biological connectivity (rRNA, conserved genes, etc.)
Bug in our software
Sequencing artifact or error
29. Why this lump?
Real biological connectivity? Probably not.
- Increasing Kfrom 32 to ~64 didnt break up the lump: not biological.
Bug in our software? Probably not.

We have a second, completely separate approach & implementation that confirmed the lump (bleu, by RosangelaCanino-Koning)

Sequencing artifact or error? YES.
-(Note, we do filter & quality trim all sequences already)
30. Good vs bad assembly graph
Low density
High density
31. Non-biological levels of local graph connectivity:
32. Higher local graph density correlates with position in read
33. Higher local graph density correlates with position in read
ARTIFACT
34. Trimming reads
Trim at high soddd, sum of degree degree distribution:
From each k-mer in each read, walk two k-mers in all directions in the graph;
If more than 3 k-mers can be found at exactly two steps, trim remainder of sequence.
Overly stringent; actually trimming (k-1) connectivity graph by degree.
35. Trimmed read examples
>895:5:1:1986:16019/2
TGAGCACTACCTGCGGGCCGGGGACCGGGTCAGCCTGCT
CGACCTGGGCCAACCGATGCGCC
>895:5:1:1995:6913/1
TTGCGCGCCATGAAGCGGTTAACGCGCTCGGTCCATAGC
GCGATG
>895:5:1:1995:6913/2
GTTCATCGCGCTATGGACCGAGCGCGTTAACCGCTTCAT
GGCGCGCAAAGATCGGAAGAGCGTCGTGTAG
36. Preferential attachment due to bias
Any sufficiently large collection of connected reads will have one or more reads containing an artifact;
These artifacts will then connect that group of reads to all other groups possessing artifacts;
and all high-coverage contigs will amalgamate into a single graph.
37. Artifacts from sequencing falsely connect graphs
38. Preferential attachment due to bias
Any sufficiently large collection of connected reads will have one or more reads containing an artifact;
These artifacts will then connect that group of reads to all other groups possessing artifacts;
and all high-coverage contigs will amalgamate into a single graph.
39. Groxel view of knot-like region / ArendHintze
40. Density trimming breaks up the lump:
Old P1,sodddtrimmed
(204.6 mreads -> 179 m):
P1: 23,444,332 reads
P2: 60,703 reads
P3: 48,818 reads
P4: 39,755 reads
P5: 34,902 reads
P6: 33,284 reads

Untrimmed partitioning (218.5 m reads):
P1: 204,582,365 reads
P2: 3583 reads
P3: 2917 reads
P4: 2463 reads
P5: 2435 reads
P6: 2316 reads

41. What does density trimming do to assembly?
204 m reads in lump:
assembles into 52,610 contigs;
total 73.5 MB
180 m reads in trimmed lump:
assembles into 57,135 contigs;
total83.6 MB
(all contigs > 1kb)
Filtered/partitioned @k=32, assembled @ k=33, expcov=auto, cov_cutoff=0
42. Wait, what?
Yes, trimming these knot-like sequences improves the overall assembly!
We remove 25.6 m reads and gain 10.1 MB!?
Trend is same for ABySS, another k-mergraph assembler, as well.
43. Is this a valid assembly?
Paired-end usage is good.
50% of contigs have BLASTX hit better than 1e-20 in Swissprot;
75% of contigs have BLASTX hit better than 1e-20 in TrEMBL;
Reference genomes sequenced by JGI:
Frateuriaaurantia: 1376 hits > 100 aa
Saprospiragrandis: 1114 hits > 100 aa
(> 50% identity over > 50% of gene)
44. So whats going on?
Current assemblers are bad at dealing with certain graph sturctures (knots).
If we can untangle knots for them, thats good, maybe?
Or, by eliminating locations where reads from differently abundant contigs connect, repeat resolution improves?
Happens with other k-mer graph assemblers (ABYSS), and with at least one other (non-metagenomic) data set.
45. OK, lets assemble!
Iowa corn (HiSeq+ GA2): 219.11 Gb of sequence assembles to:
148,053 contigs,
in220MB;
max length 20322
max coverage ~10x
all done on Amazon EC2, ~ 1 week for under $500.
Filtered/partitioned @k=32, assembled @ k=33, expcov=auto, cov_cutoff=0
46. Full Iowa corn / mapping stats
1,806,800,000 QC/trimmed reads (1.8 bn)
204,900,000 reads map to somecontig (11%)
37,244,000 reads map to contigs > 1kb (2.1%)
> 1 kb contig is a stringent criterion!
Compare:
80% of MetaHIT reads to > 500 bp;
65%+ of rumen reads to > 1kb
47. Percentage mapped vscontig size
48. High coverage partitions assemble more reads
49. Success, tentatively.
We are still evaluating assembly and assembly parameters; should be possible to improve in every way.
(~10 hrs to redo entire assembly, once partitioned.)
The main engineering point is that we can actuallyrun this entire pipeline on a relatively small machine
(8 core/68 GB RAM)
We can do dozens of these in parallel on Amazon rental hardware.
And, from our preliminary results, we get ~ equivalent assembly results as if we were scaling our hardware.
50. Optimizing per-partition assembly
Metagenomes contain mixed-abundance genomes.
Current assemblers are not built for mixed-abundance samples (problem with mRNAseq, too).
Repeat resolution
Error/edge trimming
Since were breaking the data set into multiple partitions containing reads that may assemble together, can we optimize assembler parameters (k, coverage) for each partition?
51. Mixing parameters improves assembly statistics
Objective function: maximize sum(contigs > 1kb)
4.5x average coverage gained 228 contigs/469 kb
(over 152/215 kb)
5.8x average coverage gained 78 contigs/304 kb
(over 248/708 kb)
8.2x average coverage lost 58 contigs /gained 116 kb
(over 279/803 kb)
52. Conclusions
Engineering: can assemble large data sets.
Scaling: can assemble on rented machines.
Science: can optimize assembly for individual partitions.
Science: retain low-abundance.
53. Caveats
Quality of assembly??
Illumina sequencing bias/error issue needs to be explored.
Regardless of Illumina-specific issue, its good to have tools/approaches to look at structure of large graphs.
Need to better analyze upper limits of data structures.
Have not applied our approaches to high-coverage data yet; in progress.
54. Future thoughts
Our pre-filtering technique alwayshas lower memory requirements than Velvet or other assemblers.So it is a good first step to try, even if it doesnt reduce the problem significantly.
Divide & conquer approach should allow more sophisticated (compute intensive) graph analysis approaches in the future.
This approach enables (in theory) assembly of arbitrarily large amounts of metagenomic DNA sequence.
Can k-mer filtering work for non-de Bruijn graph assemblers? (SGA, ALLPATHS-LG, )
55. Acknowledgements
The k-mer gang:
Adina Howe
Jason Pell
RosangelaCanino-Koning
QingpengZhang
ArendHintze
Collaborators:
Jim Tiedje (Il padrino)
Janet Jansson, Rachel Mackelprang, Regina Lamendella, Susannah Tringe, and many others (JGI)
Charles Ofria (MSU)
Funding: USDA NIFA; MSU, startup and iCER; DOE; BEACON/NSF STC; Amazon Education.
56. 57. A guide to khmer
Python wrapping C++; BSD license.
Tools for:
K-mer abundance filtering (constant mem; inexact)
Assembly graph size filtering (constant mem; exact)
Assembly graph partitioning (exact)
Error trimming (constant mem; inexact)
Still in alpha form undocumented, esp.
58. k-mer coverage by partition
59. Abundance filtering affects low-coverage contigs dramatically
60. Many read pairs map together
61. Bonus slides
How much more do we need to sequence, anyway??
62. Calculating expected k-mer numbers
Entire population
S1
S2
Note: no simple way to correct abundance bias, so we dont, yet.
63. Coverage estimates
(Based on k-mer mark/recapture analysis.)
Iowa prairie (136 GB):est 1.26 x
Iowa corn (62 GB): est 0.86 x
Wisconsin corn (190 GB):est 2.17 x
For comparison, the panda genome assembly
used ~50x with short reads.
Qingpeng Zhang
64. Coverage estimates: getting to 50x
Human-> 150 GB for 50x
Iowa prairie (136 GB): est 1.26 x -> 5.4 TB for 50x
Iowa corn (62 GB): est 0.86 x -> 3.6 TB for 50x
Wisconsin corn (190 GB): est 2.17 x -> 4.4 TB for 50x
note that its not clear what coverage exactly means in this case, since 16s-estimated diversity is very high.
65. What does coverage mean here?
Unseen sequence:
1x ~ 37%
2x ~ 14%
5x ~ 0.7%
10x ~ .00005%
50x ~ 2e-20%
For metagenomes, coverage is of abundance weighted DNA.
66. CAMERA Annotation of full set contigs(>1000 bp)
# of ORFS:344,661 (Metagene)
Longest ORF:1,974 bp
Shortest ORF:20 bp
Average ORF:173 bp
# of COG hits:153,138 (e-value < 0.001)
# of Pfam hits:170,072
# of TIGRfam hits:315,776
67. CAMERA COG Summary
68. The k-mer oracle
Q: is this k-mer present in the data set?
A: no => then it is not.
A: yes => it may or may not be present.
This lets us store k-mers efficiently.
69. Building on the k-mer oracle:
Once we can store/query k-mers efficiently in this oracle, we can build additional oracles on top of it:
70. The k-mer graph oracle
Q: does this k-mer overlap with this other k-mer?
A: no => then it does not, guaranteed.
A: yes => it may or may not.
This lets us traverse de Bruijn graphs efficiently.
71. The contig size oracle
Q: could this read contribute to a contig bigger than N?
A: yes => then it might.
This lets us eliminate reads that do not belong to big contigs.
72. The read partition oracle
Does this read connect to this other read in any way?
A: yes => then it might.
This lets us subdivide the assembly problem into many smaller, disconnected problems that are much easier.
73. Oracular fact
All of these oracles are cheap, can yield answers from a different probability distribution, and can be chained together (so you can keep on asking oracles for as long as you want, and get more and more accurate).
74. Implementing a basic k-mer oracle
Conveniently, perhaps the simplest data structure in computer science is what we need
a hash table that ignores collisions.
Note, P(false positive) = fractional occupancy.
75. A more reliable k-mer oracle
Use a Bloom filter approach multiple oracles, in serial, are multiplicatively more reliable.
76. Scaling the k-mer oracle
Adding additional filters increases discrimination at the cost of speed.
This gives you a fairly straightforward tradeoff: memory (decrease individual false positives) vs computation (more filters!)
77. 78. 79. The k-mer oracle, revisited
We can now ask, does k-mer ACGTGGCAGG occur in the data set?, quickly and accurately.
This implicitly lets us store the graph structure, too!
80. B. Partitioning graphs into disconnected subgraphs
Which nodes do not connect to each other?
81. Partitioning graphs its easy looking
Which nodes do not connect to each other?
82. But partitioning big graphs is expensive
Requires exhaustive exploration.
83. But partitioning big graphs is expensive
84. Tabu search avoid global searches
85. Tabu search systematic local exploration
89. Strategies for completing big searches
90. Hard-to-traverse graphs are well-connected
91. Add neighborhood-exclusion to tabu search
92. Exclusion strategy lets you systematically explore big graphs with a local algorithm
93. Potential problems
Our oracle can mistakenly connect clusters.
94. Potential problems
This is a problem if the rate is sufficiently high!
95. However, the error is one-sided:
Graphs will never be erroneously disconnected
96. The error is one-sided:
Nodes will never be erroneously disconnected
97. The error is one-sided:
Nodes will never be erroneously disconnected.
This is critically important: it guarantees that our k-mer graph representation yields reliable no answers.
This, in turn, lets us reliably partition graphs into smaller graphs.
98. Actual implementation

Climbing Mt. Metagenome

Technology

Transcript of Climbing Mt. Metagenome