RIFRAF: a frame-resolving consensus algorithmDec 03, 2017  · RIFRAF Eren and Murrell November 2017...

13
RIFRAF: a frame-resolving consensus algorithm Kemal Eren Bioinformatics and Systems Biology, University of California San Diego [email protected] Ben Murrell * Department of Medicine, University of California San Diego [email protected] 29 November 2017 Abstract Motivation: Protein coding genes can be studied using long-read next generation sequencing. However, high rates of indel sequencing errors are problematic, corrupting the reading frame. Even the consensus of multiple independent sequence reads retains indel errors. To solve this problem, we introduce RIFRAF,a sequence consensus algorithm that takes a set of error-prone reads and a reference sequence and infers an accurate in-frame consensus. RIFRAF uses a novel structure, analogous to a two-layer hidden Markov model: the consensus is optimized to maximize alignment scores with both the set of noisy reads and with a reference. The template-to-reads component of the model encodes the preponderance of indels, and is sensitive to the per-base quality scores, giving greater weight to more accurate bases. The reference-to-template component of the model penalizes frame-destroying indels. A local search algorithm proceeds in stages to find the best consensus sequence for both objectives. Results: Using Pacific Biosciences SMRT sequences of NL4-3 env, we compare our approach to other consensus and frame correction methods. RIFRAF consistently finds a consensus sequence that is more accurate and in-frame, especially with small numbers of reads. It was able to perfectly reconstruct over 80% of consensus sequences from as few as three reads, whereas the best alternative required twice as many. RIFRAF is able to achieve these results and keep the consensus in-frame even with a distantly related reference sequence. Moreover, unlike other frame correction methods, RIFRAF can detect and keep true indels while removing erroneous ones. Availability: RIFRAF is implemented in Julia, and source code is publicly available at https://github.com/MurrellGroup/Rifraf.jl. Contact: [email protected] I. Introduction The problem of finding the consensus of a set of sequences is fundamental to bioinformat- ics, especially in the age of high-throughput sequencing. This paper addresses the task of re- * Corresponding author constructing an unknown true sequence from a set of error-prone reads. Many algorithms that solve this task focus on de-novo or reference- guided assembly of short reads [17, 18, 15]. However, with the advent of third-generation single-molecule sequencing technologies, such as Pacific Biosciences’ SMRT sequencing proto- 1 . CC-BY 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted December 3, 2017. ; https://doi.org/10.1101/227520 doi: bioRxiv preprint

Transcript of RIFRAF: a frame-resolving consensus algorithmDec 03, 2017  · RIFRAF Eren and Murrell November 2017...

Page 1: RIFRAF: a frame-resolving consensus algorithmDec 03, 2017  · RIFRAF Eren and Murrell November 2017 bioRxiv preprint col [6], it is now possible to perform full-length sequencing

RIFRAF: a frame-resolving consensus algorithm

Kemal Eren

Bioinformatics and Systems Biology, University of California San [email protected]

Ben Murrell∗

Department of Medicine, University of California San [email protected]

29 November 2017

Abstract

Motivation: Protein coding genes can be studied using long-read next generation sequencing. However,high rates of indel sequencing errors are problematic, corrupting the reading frame. Even the consensus ofmultiple independent sequence reads retains indel errors. To solve this problem, we introduce RIFRAF, asequence consensus algorithm that takes a set of error-prone reads and a reference sequence and infers anaccurate in-frame consensus. RIFRAF uses a novel structure, analogous to a two-layer hidden Markov model:the consensus is optimized to maximize alignment scores with both the set of noisy reads and with a reference.The template-to-reads component of the model encodes the preponderance of indels, and is sensitive to theper-base quality scores, giving greater weight to more accurate bases. The reference-to-template component ofthe model penalizes frame-destroying indels. A local search algorithm proceeds in stages to find the bestconsensus sequence for both objectives.Results: Using Pacific Biosciences SMRT sequences of NL4-3 env, we compare our approach to otherconsensus and frame correction methods. RIFRAF consistently finds a consensus sequence that is moreaccurate and in-frame, especially with small numbers of reads. It was able to perfectly reconstruct over80% of consensus sequences from as few as three reads, whereas the best alternative required twice asmany. RIFRAF is able to achieve these results and keep the consensus in-frame even with a distantly relatedreference sequence. Moreover, unlike other frame correction methods, RIFRAF can detect and keep true indelswhile removing erroneous ones.Availability: RIFRAF is implemented in Julia, and source code is publicly available athttps://github.com/MurrellGroup/Rifraf.jl.Contact: [email protected]

I. Introduction

The problem of finding the consensus of a setof sequences is fundamental to bioinformat-ics, especially in the age of high-throughputsequencing. This paper addresses the task of re-

∗Corresponding author

constructing an unknown true sequence from aset of error-prone reads. Many algorithms thatsolve this task focus on de-novo or reference-guided assembly of short reads [17, 18, 15].However, with the advent of third-generationsingle-molecule sequencing technologies, suchas Pacific Biosciences’ SMRT sequencing proto-

1

.CC-BY 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which wasthis version posted December 3, 2017. ; https://doi.org/10.1101/227520doi: bioRxiv preprint

Page 2: RIFRAF: a frame-resolving consensus algorithmDec 03, 2017  · RIFRAF Eren and Murrell November 2017 bioRxiv preprint col [6], it is now possible to perform full-length sequencing

RIFRAF • Eren and Murrell • November 2017 • bioRxiv preprint

col [6], it is now possible to perform full-lengthsequencing of entire genes or small genomes.Here we will focus on finding the consensusof a set of amplicon sequences - where the se-quences have the same start and end points.An example application would be targeted se-quencing of an entire gene from a viral popula-tion (eg. [11]). We focus just on the consensusreconstruction problem, assuming that readshave first been grouped by genetic identity, ei-ther using primer ID barcodes [7, 22], or someform of clustering.

Consensus sequences found via multiple se-quence alignment may be inaccurate whenthere are few reads available, or when the readscontain many errors. SMRT sequencing in par-ticular is known to contain mostly indel errors,especially in homopolymer runs. For example,in [11], we discovered that 80% of the sequenc-ing errors were indels. If these indels carryover into the consensus sequence, they causeframeshift errors which corrupt the readingframe, and render the amino acid sequenceuninterpretable. If a reference sequence witha trusted reading frame is available, it can beexploited to inform the consensus.

Current approaches that attempt to recon-struct in-frame consensus sequences considerthese problems separately. There are ap-proaches to infer the consensus of multiplereads, and there are approaches to correct thereading frame of an already-inferred consen-sus sequence. Here, we solve these problemsjointly, simultaneously considering evidencefrom the reads and the reference sequence.

One common approach to inferring con-sensus sequences is from multiple sequencealignments (MSAs), from which the consen-sus is calculated by taking the most commonbase in each column. A myriad of multiplealignment algorithms are available [19], anyof which may be used for this task. This pa-per uses MAFFT [9, 8] as an example of thisstrategy when comparing alternatives. A mul-titude of tools, such as the cons command inEMBOSS [20], are available for computing theconsensus of these alignments. Another ap-proach is to use a partial order alignment [13]

representation of the set of sequences, andfind the consensus sequence using dynamicprogramming to extract the heaviest bundles[12]. This paper uses poaV21 for comparison.Other implementations of this approach in-clude pbdagcon2, which was released by Pa-cific Biosciences specifically for raw SMRTsequence reads, and nanopolish [14], whichwraps poaV2 for Oxford Nanopore reads. Fi-nally, specialized consensus methods are avail-able for specific sequencing technologies; thesemethods model the specific behavior of theirtarget protocol, such as read length and errormodel. In this domain, Pacific Biosciences de-veloped the Quiver [4] and Arrow algorithms3

for building circular consensus sequences fromraw ZMW reads.

Existing approaches for reading frame cor-rection (such as FrameBot, which we use hereas a comparator) exploit frame-aware codonalignment to a protein reference, followedby inserting or deleting bases in the targetsequence [24]. Related algorithms includeFALP and LAST [21], Frame-Pro [5], HMMFrame[25], and others. Another approach is hybridsequencing, which supplements long single-molecule reads with short reads [19]. Methodssuch as HGAP [4] use hybrid sequence data tofind and remove indels.

This paper introduces a new method for in-ferring consensus sequences of such reads: theReference-Informed Frame-Resolving multiple-Alignment Free consensus algorithm (RIFRAF).RIFRAF considers evidence from both the readsand the reference simultaneously, allowingreads to inform the frame correction process,and is sensitive to the read quality scores toensure that high-quality bases are more infor-mative. These features allow RIFRAF to makehighly accurate predictions, even for a smallnumber of error-prone reads. Unlike otherframe-correction methods, RIFRAF can detecttrue frameshift-causing indels and keep them

1https://sourceforge.net/projects/poamsa/2https://github.com/PacificBiosciences/

pbdagcon3https://github.com/PacificBiosciences/

GenomicConsensus

2

.CC-BY 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which wasthis version posted December 3, 2017. ; https://doi.org/10.1101/227520doi: bioRxiv preprint

Page 3: RIFRAF: a frame-resolving consensus algorithmDec 03, 2017  · RIFRAF Eren and Murrell November 2017 bioRxiv preprint col [6], it is now possible to perform full-length sequencing

RIFRAF • Eren and Murrell • November 2017 • bioRxiv preprint

while removing spurious indels.

II. Methods

RIFRAF addresses the following sequence con-sensus problem. Let t be an unknown tem-plate sequence, which is sequenced N timesto generate a set of N pairs of reads and qual-ity scores R = {(si, pi)}N

i=1. Each read si is anoisy observation of t, and each pi is a vectorof error probabilities, one for each base in si.The ith character in read s is denoted si, andthe substring from the ith to the jth character isdenoted si...j. pi is the probability that si is anerror; an error at a base is either a substitution,an insertion, or a deletion has occurred next toit. The task is to infer a consensus sequence cthat matches the unknown t. Additionally, wealso consider a reference sequence r and preferthat c not contain insertions or deletions thatchange its reading frame relative to r. This isespecially useful when the template that gen-erated the reads in R had an intact readingframe, but the reads themselves have a highindel rate.

r

t

s1, p1 s2, p2 s3, p3 · · · sN , pN

Figure 1: Structure of the full model. The unknowntemplate t (grey) has the same reading frameas known reference r. The sequencing pro-cess generates error-prone reads s1 . . . sN withquality scores p1 . . . pN .

The structure of the full RIFRAF model isshown in Figure 1. It infers the unknowntemplate by optimizing two objectives: thequality-aware alignment to the reads, and aframe-aware alignment to the reference. Theoptimization procedure starts with an initialconsensus sequence and proceeds in an itera-tive greedy manner, mutating the consensussequence at every step to improve those objec-

tives. RIFRAF uses a number of techniques tospeed up convergence: filtering mutations, ac-cepting multiple mutations, forward and back-ward alignments, banding, batching, increasingindel penalties, and multi-stage optimization.RIFRAF is implemented in Julia [1], a high-

level scientific computing language.

i. Objective 1: pairwise alignment toreads

In order to find the optimal consensus, it is nec-essary to assign a score to candidates. RIFRAFscores consensus sequences by a global pair-wise alignment [23, 16] of each read s with thecurrent values of c. Let A be the |s|+ 1× |c|+ 1dynamic programming matrix for aligning cand s. Each ai,j is the score of aligning prefixs1...i to prefix c1...j. a0,0 is initialized to 0, andthe last cell a|s|+1,|c|+1 contains the score for thefull alignment. The score function for c and sis defined as the full alignment score: S(c|s) =a|s|+1,|c|+1. The overall score of consensus se-quence c is the sum over the alignment scoresfor all reads: S(c|R) = ∑(s,p)∈R S(c|s).

The sequencing process has an error rate ρ,which by assumption can can be partitionedinto ρ = ρmismatch + ρinsertion + ρdeletion. Theseparameters account for the different error pro-files of different sequencing technologies. Forinstance, in SMRT sequencing, indels are morelikely than substitutions. The base move scoresfor the alignment are derived from these errorprobabilities.

Typical pairwise alignment uses fixed scoresfor moves. However, RIFRAF also incorporatessequence qualities into the move scores to gen-erate more accurate alignments. The scores formatch, insertion, and deletion moves dependon the error probabilities p in the followingway. Let q = log10 p (base 10 is used insteadof the usual natural logarithm for compatibil-ity with quality scores such as Phred scores).Let qmismatch = log10 ρmismatch, and similarly forthe others. Then move scores are calculated asfollows

• A diagonal move from ai−1,j−1 to ai,j hasscore log10(1− pi) if si = cj (ie. a match),

3

.CC-BY 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which wasthis version posted December 3, 2017. ; https://doi.org/10.1101/227520doi: bioRxiv preprint

Page 4: RIFRAF: a frame-resolving consensus algorithmDec 03, 2017  · RIFRAF Eren and Murrell November 2017 bioRxiv preprint col [6], it is now possible to perform full-length sequencing

RIFRAF • Eren and Murrell • November 2017 • bioRxiv preprint

else qmismatch + qi (ie. a mismatch).• A vertical move (insertion relative to c)

from ai−1,j to ai,j has score qinsertion + qi.• A horizontal move (deletion relative to

c) from ai,j−1 to ai,j has score qdeletion +max(qi + qi+1). If i = 0, the score is justqdeletion + q1; similarly, i = |s|, the score isjust qdeletion + q|s|.

Intuitively, the penalties for mismatches, in-sertions, and deletions are more severe whenthe consensus does not match higher qualityregions of the reads. PHRED values are cappedat 30 because rarer sources of error that are notsequencing errors (eg. PCR errors) may havevery confident PHRED scores, and we do notwish these to be overly informative. This capcan be adjusted if these sources of error can beruled out (for example if PCR was not used togenerate the amplicon library).

The best consensus c∗ under Objective 1(pairwise alignment to reads) is the one thatmaximizes S(c|R).

ii. Objective 2: Frame-aware align-ment to reference

To perform frame correction, RIFRAF requiresa reference nucleotide sequence r, which isknown to be in-frame. It models the refer-ence sequence r as having diverged from thetemplate t, where the differences between rand t represent evolutionary events, not se-quencing error as in Objective 1. The scorefor the consensus-reference alignment is mod-ified to reflect this difference. First, two newmoves are allowed during alignment: codoninsertion and codon deletion, each with theirown penalty, as shown in Figure 2. Second,a new parameter tindel is used as a multiplierfor the non-codon insertion and deletion penal-ties. Together, these two modifications bias thealignment to prefer only codon indels, keep-ing the consensus in-frame. Because it usesnucleotide alignments, this method works maybe expected to work better with more closelyrelated reference sequences, where nucleotidesimilarity is preserved.

ai,j

ai−1,j−1

ai,j−1

ai−1,j

ai,j−3

ai−3,j

Figure 2: Codon moves in the reference alignment dy-namic programming matrix. The goal is tofavor a consensus that preserves the readingframe. Thus, in addition to the usual singlematch, insertion, and deletion moves, codoninsertions and deletions are also allowed, witha lower penalty than single-base indels.

We first let RIFRAF converge to a draft tem-plate c without the reference sequence. Thisdraft template is used to approximate the di-vergence between the true template and the ref-erence, taking the edit distance normalized bythe max length d(r, c)/max(|r|, |c|) to obtain aper-base probability of template/reference dis-agreement (which is used in the same manneras the per-base quality scores p in Objective1). Reference (mis)match, indel, and codon er-ror rates are provided as parameters, and thescores for each move are computed from errorrate ρ as log10(ρ), as before.

The insertion and deletion scores are mul-tiplied by a penalty tindel , which controls theinfluence of single insertions and deletions inthe reference alignment. If tindel is small, frame-destroying indels may appear in the consensus,but if it is large, the consensus will be forcedinto the reference reading frame, even if theunobserved template really did contain indels.As we show in Section III, this penalty can betuned to discard spurious indels while keepingtrue ones.RIFRAF combines both objectives into a sin-

gle score, allowing the reads to inform theframe correction. The score of the consensus

4

.CC-BY 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which wasthis version posted December 3, 2017. ; https://doi.org/10.1101/227520doi: bioRxiv preprint

Page 5: RIFRAF: a frame-resolving consensus algorithmDec 03, 2017  · RIFRAF Eren and Murrell November 2017 bioRxiv preprint col [6], it is now possible to perform full-length sequencing

RIFRAF • Eren and Murrell • November 2017 • bioRxiv preprint

to reference alignment is denoted Sr(c|r), andthe full score function is:

S(c|R, r) = S(c|R) + Sr(c|r).

iii. Optimization procedure

An exhaustive search for the optimal consen-sus c∗ would be intractable, so RIFRAF usesa variant of the following greedy search algo-rithm, with some optimizations to speed upconvergence:

1. Start with a guess c0. RIFRAF chooses theread with the lowest expected number oferrors.

2. For the most recent guess ci, examine aset of candidate single mutations, suchas insertions, deletions, and substitutions.Note that these candidates vary at each op-timization stage. Keep all that improve thescore S(ci|R, r). Call the set of candidatemutations C.

3. If C is empty, accept ci and terminate. Oth-erwise, choose some subset of C, applythem to ci to obtain ci+1, and iterate.

RIFRAF works in two stages, first optimiz-ing just S(c|R), and then optimizing the fullS(c|R, r).

iii.1 Filtering mutations

When comparing the template to the reads, weneed not consider all possible modificationsto the current consensus. For example, if anycandidate mutation to c does not appear inany pairwise alignment of c with a read, thatmutation need not be scored. Since it has nosupport among any observed sequence, it islikely to hurt the alignment score. Similarly,during the frame correction stage, the modelonly proposes insertions or deletions that ap-pear in the pairwise alignment to reference.

iii.2 Multiple mutations

Instead of accepting only the best mutationin C, RIFRAF accepts all the mutations that

are separated by a certain number of posi-tions: nseparate (the default value is 15, i.e. fivecodons). The candidates are accepted in orderfrom best to worst score. This policy allowsRIFRAF to converge in many fewer iterationsthan if it only accepted one mutation per iter-ation. nseparate ensures that the changes to theconsensus are relatively independent of eachother, and that the score of one is unlikely to beaffected by the acceptance of another. After ac-cepting mutations in C, RIFRAF also comparesthe new score to the score that would be ob-tained from accepting only the single best mu-tation in C, and optionally accepts that singlemutation instead if it results in a better score.

iii.3 Forward and backward alignments

Recomputing the full alignment matrix foreach candidate mutation to c would be pro-hibitively expensive. For a sequence c fromalphabet {A, C, G, T}, there are 4(|c| + 1) in-sertions, 3|c| substitutions, and |c| deletions toconsider. Computing the alignment matrix Afor each candidate requires O(cs) operations,so each iteration of the proposed algorithmwould require O(Nc2s) operations (we omit| · | in O(·) for clarity). Instead, RIFRAF usesforward and backward alignments to computethe new score for any single change to c byonly recomputing a single column of A [4].

To achieve this, in addition to the prefixalignment matrix A, where ai,j is the score foraligning prefix s1...i to prefix c1...j, RIFRAF alsocomputes the suffix alignment matrix B, wherebi,j is the score for aligning suffix si+1...|s| tosuffix cj+1...|c|. Note that a|s|,|c| = b0,0 is thescore for the full alignment. For any j, thatalignment score can also be computed fromcolumns A·,j and B·,j:

∀j ∈ [0 . . . v] : a|s|,|c| = b0,0 = maxi(ai,j + bi,j)(1)

Modifying cj leaves unchanged columns0 . . . j − 1 of A, and also leaves unchangedcolumns j . . . v of B. Therefore, for all threetypes of mutations, computing the new score

5

.CC-BY 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which wasthis version posted December 3, 2017. ; https://doi.org/10.1101/227520doi: bioRxiv preprint

Page 6: RIFRAF: a frame-resolving consensus algorithmDec 03, 2017  · RIFRAF Eren and Murrell November 2017 bioRxiv preprint col [6], it is now possible to perform full-length sequencing

RIFRAF • Eren and Murrell • November 2017 • bioRxiv preprint

requires that at most only a single new columnof A must be recalculated.

1. substitution at cj: compute A·,j; new scoreis maxi(ai,j + bi,j).

2. insertion after cj: compute A·,j+1; newscore is maxi(ai,j+1 + bi,j).

3. deletion of cj: no new column necessary;new score is maxi(ai,j−1 + bi,j).

Using the forward and backward alignments,all possible mutations to the consensus can bescored in O(Ncs) operations.

During the alignment of the template andreference, additional columns must be recom-puted to account for codon insertion and dele-tion moves.

iii.4 Banding

b

b||s| − |c||

Figure 3: Banded alignment. Alignments must staywithin the banded region of the dynamic pro-gramming matrix.

Despite the improvements from using for-ward and backward alignments, each iterationis still approximately quadratic in the length ofthe consensus, assuming |c| ≈ |s|. Alignmentbanding [2, 3] further reduces the number ofoperations per iteration. For a given band-width parameter b, the maximum usable col-umn size in A and B is 2b + ||s| − |c|| � |s|, soevaluating a possible mutation requires manyfewer operations than recomputing the full col-umn. Alignment moves are only allowed tooriginate inside the band, so alignment pathsmust stay within the band boundaries (see Fig-ure 3). With banding, the time complexityper iteration becomes O(Nc(

√s + b)), since

||s| − |c|| grows like√|s| under reasonable as-

sumptions.RIFRAF dynamically increases the band-

width if the number of differences in thebanded alignment is sufficiently larger thanthe expected number of differences implied bythe read’s quality scores, under the assumptionthat the difference between the template candi-date c and the true template is much smallerthan the number of sequencing errors in s. Letr be the observed number of differences be-tween s and c, and e be the expected numberof errors computed from the quality scores p.If the value of r is in the upper tail of a Poissondistribution with mean parameter e, then thebandwidth is doubled and the alignments arere-computed. α controls the size of this uppertail probability, with a default value of 0.1.

iii.5 Batching

RIFRAF uses a variety of batching strategiesto speed up convergence. If the number ofreads is greater than a threshold k (default 5),the best k reads by error rate are fixed as theinitial batch, and RIFRAF runs to convergence.This ensures that RIFRAF first converges with-out considering the many spurious mutationspresumably present in less accurate reads. Theresulting initial guess is further refined at therefinement stage, this time with a different ran-dom batch of size nbatch (default 20) for eachiteration. Sequences are chosen for inclusion inthe batch by sampling from a multinomial dis-tribution of their error rates, parameterized byparameter ρ between 0 and 1. When ρ = 1, allthe weight is evenly distributed among the topnbatch sequences. Interpolating from ρ = 1.0to ρ = 0.5, the probabilities become propor-tional to the read error rates. Interpolatingfrom ρ = 0.5 to ρ = 0, the probabilities becomeuniform. By default, ρ = 0.9i, where i is thenumber of iterations since random batchingactivated. Like the fixed batch, this strategyspeeds up convergence by initially avoidinginaccurate reads, then gradually letting themcontribute to resolve uncertain bases if neces-sary.

6

.CC-BY 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which wasthis version posted December 3, 2017. ; https://doi.org/10.1101/227520doi: bioRxiv preprint

Page 7: RIFRAF: a frame-resolving consensus algorithmDec 03, 2017  · RIFRAF Eren and Murrell November 2017 bioRxiv preprint col [6], it is now possible to perform full-length sequencing

RIFRAF • Eren and Murrell • November 2017 • bioRxiv preprint

Ideally, nbatch is small enough to make eachiteration fast, but large enough that RIFRAFconverges stably. RIFRAF tries to detect if nbatchis too small by monitoring the change in scoreafter each iteration. If the new score is worsethan the old score by more than a certain per-cent (10% by default), nbatch is increased to2nbatch, then 3nbatch, etc.

Combining all of the previous optimizations,a single iteration’s time complexity is reducedfrom O(Nc2s) to O(nbatchc(

√s + b).

iv. Increasing indel penalties

Whenever the algorithm converges to a con-sensus ci, if single indel moves were used incomputing Sr(·), the single insertion and dele-tion scores are multiplied by a parameter tindel ,increasingly encouraging the alignment withthe reference to use only codon indels, therebykeeping c in-frame. This process repeats up tom times, so the maximum multiplier is (tindel)

m.If the penalty is large enough, the consensuswill always be forced into the reference’s read-ing frame, which is the default behavior. How-ever, some consensus sequences really are outof frame relative to the reference. The indelpenalties can be tuned so that RIFRAF correctlyidentifies true frameshifts, with a small risk ofallowing some spurious ones.

v. Multi-stage optimization

The full optimization procedure proceeds instages, allowing RIFRAF to converge quicklyby focusing on different objectives in differentstages.

1. Initial stage: Do not use the reference. Pro-pose all mutations to the consensus thatappear in the pairwise alignments. Usethe fixed batch, if available. If no referencewas provided, stop after this step.

2. Frame correction stage: Use the full model,including the reference and reads. Pro-pose indel candidate mutations that ap-pear in the consensus-reference alignment.

Increasingly penalize single indels in align-ment of r and c. Use the fixed batch, ifavailable.

3. Refinement stage: Propose only substitu-tions (no indels) to the consensus that ap-pear in the pairwise alignments, no longerconsidering the reference. Use randombatches, with decreasing ρ.

The initial stage quickly finds a good can-didate consensus from the reads alone. Theframe correction stage uses a reference to pe-nalize indels that cause frame shift errors, cor-recting the reading frame of the template in away that is maximally compatible with both thereads and the reference. Finally, the refinementstage ensures that the reference influences onlythe frame of c, and exerts no bias upon the nu-cleotides themselves. The final stage also fixesbiases introduced by the fixed batch in the firsttwo stages.

III. Results

We compared RIFRAF to two other meth-ods: MAFFT [10] followed by the standard per-column consensus, and POA [13] with the heav-iest bundle consensus algorithm [12]. All threemethods were run with and without reference-guided frame correction. RIFRAF natively per-forms frame correction, but only if it is given areference sequence. To distinguish these mod-els in this section, we refer to the model withno reference as RIFRAFnr, and the model witha reference as RIFRAFref. FrameBot [24] wasused for correcting results from MAFFT and POA;these are referred to as MAFFT_FB and POA_FB.

A full-length sequencing run of Pacific Bio-sciences SMRT sequencing on env from HIV-1subtype B strain NL4-3 was used for the com-parison [11]. The true sequence of NL4-3 isknown, so results could be compared to theground truth. The filtered data (available onFigShare4) contains 27,600 reads with expectederror rate 1% or better, which were further fil-tered and processed as follows. To make the

4https://doi.org/10.6084/m9.figshare.5643247

7

.CC-BY 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which wasthis version posted December 3, 2017. ; https://doi.org/10.1101/227520doi: bioRxiv preprint

Page 8: RIFRAF: a frame-resolving consensus algorithmDec 03, 2017  · RIFRAF Eren and Murrell November 2017 bioRxiv preprint col [6], it is now possible to perform full-length sequencing

RIFRAF • Eren and Murrell • November 2017 • bioRxiv preprint

NL4-3 HXB2 B.DK B.ES B.CN B.BRreference

0

1

2

3

4

5

6

7

8

dna

edit

dist

ance

RIFRAF_refMAFFT_FBPOA_FB

(a) DNA sequence edit distance for increasingly distantreferences.

NL4-3 HXB2 B.DK B.ES B.CN B.BRreference

0

1

2

3

4

5

prot

ein

edit

dist

ance

RIFRAF_refMAFFT_FBPOA_FB

(b) Protein sequence edit distance for increasingly distantreferences.

Figure 4: Edit distance for increasingly distant references. All methods do better with closely-related reference, buttheir rate of performance degradation is important because a related reference may not always be available.Run with N = 3 full-length reads.

problem more challenging and better reveal dif-ferences between methods, very high qualitysequences were excluded (expected error rate< 0.1%). Short fragments and long reads (oftenconcatemers) were discarded by filtering outsequences 25 bases shorter or longer than themedian of 2,597. PacBio reads come in randomorientations, so reads were converted to theirreverse complement, if necessary. Extra basesaround the amplicon were removed by align-ing to NL4-3 env without penalizing terminalgaps, then trimming terminal insertions. Afterpreprocessing, 9,473 sequences remained, witha mean error rate of 0.0015 (the distribution oferrors appears in Figure S1). All experimentswere run for 1,000 trials on randomly sampledreads.

Choice of reference. A set of reference se-quences – shown in Figure S2 – were tested toinvestigate how frame correction accuracy dete-riorates for distantly-related references. The re-sults are shown in Figure 4. Nucleotide resultsfrom MAFFT_FB and POA_FB were both equallyinsensitive to the choice of reference, whereasRIFRAFref’s results did degrade slightly. How-ever, the reverse is true for the protein se-quences, with RIFRAFref’s performance de-grading by half an amino acid on average, andthe others degrading by more than one. Thisdifference indicates that RIFRAFref not onlykeeps the consensus in-frame, but also makesbetter choices of inferring which nucleotidesare truly indel errors. Finally, RIFRAFref was

the most accurate, regardless of choice of ref-erence. As expected, RIFRAFref’s frame correc-tion strategy works best with a closely relatedreference, but these results show that it is ca-pable of working even with a distant reference.Except where noted, the most distant reference,B.BR, was used for the rest of the results.

Number of sequences. Clusters of 2, 3, 4,5, 6, 8, 10, 15, and 20 reads were randomlysampled for this experiment. The fraction ofperfectly reconstructed consensus sequencesper 1,000 trials appears in Figure 5a. For fewerthan ten sequences, both versions of RIFRAFdominate the other corresponding methods.For instance, RIFRAFref gets over 90% correctwith access to only four reads. POA_FB doesnot achieve similar results until N = 8, andMAFFT_FB does not until between N = 10 and15. Interestingly, POA’s results actually degradesignificantly for n > 6, but POA_FB continuesto improve, because POA tends to include extrabases on the ends of the consensus sequencewhich are then removed by FrameBot. Theseextra bases also affect the average number ofnucleotide errors (Figure 5b): for N = 20, POAaverages one error per sequence, whereas allthe other methods average none.

The average number of protein errors (Fig-ure 6a) highlights the importance of frame cor-rection. Frame shifts cause the translated con-sensus sequences to differ greatly from the trueprotein sequence, especially for n < 15. ForN = 2, fully half of each protein sequence is

8

.CC-BY 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which wasthis version posted December 3, 2017. ; https://doi.org/10.1101/227520doi: bioRxiv preprint

Page 9: RIFRAF: a frame-resolving consensus algorithmDec 03, 2017  · RIFRAF Eren and Murrell November 2017 bioRxiv preprint col [6], it is now possible to perform full-length sequencing

RIFRAF • Eren and Murrell • November 2017 • bioRxiv preprint

2 3 4 5 6 8 10 15 20number of sequences

0.0

0.2

0.4

0.6

0.8

1.0

fract

ion

of se

quen

ces c

orre

ct

RIFRAF_refRIFRAF_nrPOA_FBPOAMAFFT_FBMAFFT

(a) Fraction of correct sequences versus number of se-quences.

2 3 4 5 6 8 10 15 20number of sequences

0

5

10

15

20

25

dna

edit

dist

ance

RIFRAF_refRIFRAF_nrPOA_FBPOAMAFFT_FBMAFFT

(b) DNA edit distance versus number of sequences.

Figure 5: DNA results. Fraction of correct sequences (left) and mean edit distance between the consensus and thetemplate (right) for increasing N.

2 3 4 5 6 8 10 15 20number of sequences

0

100

200

300

400

500

prot

ein

edit

dist

ance

RIFRAF_refRIFRAF_nrPOA_FBPOAMAFFT_FBMAFFT

(a) Protein edit distance versus number of sequences; allmethods.

2 3 4 5 6 8 10 15 20number of sequences

0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

prot

ein

edit

dist

ance

RIFRAF_refMAFFT_FBPOA_FB

(b) Protein edit distance versus number of sequences; framecorrection only

Figure 6: Same results as Fig. 5, but for the translated protein sequences. The fraction of correct sequences is notreproduced, since those figures are identical. The left figure show results for all methods. The right figureshow the same data, zoomed to show the frame-corrected results. Note Y axis scale.

wrong on average, regardless of method. Evenfor N = 20, sequences from RIFRAFnr and POAcontain about 100 errors. On the other hand,the corrected sequences (shown in Figure 6bfor clarity) contain nearly no errors for n > 10.RIFRAFref again performs best here, approach-ing zero errors even for N = 3.

Interestingly, frame correction of MAFFT andPOA often made the nucleotide sequences lessaccurate, whereas it improves RIFRAFref. Thisresult supports the idea that RIFRAF’s methodof integrating frame correction into the consen-sus algorithm makes it more accurate by allow-ing all reads to inform the correction process.FrameBot, which only has access to a singleconsensus sequence, cannot use the extra in-formation in the reads, and therefore cannotachieve the same accuracy.

Execution times appear in Figure 7. With-out frame correction, all three methods arecomparable for small numbers of sequences,but RIFRAFnr scales better, due to its batch-ing scheme. Frame correction adds a constantfactor to all three methods’ execution times.RIFRAFref’s constant factor is larger, but, be-cause it scales better, it overtakes the othersbetween N = 10 and N = 15.

Sequence length. Figure 7 also shows exe-cution time for varying sequence lengths. Formore details on this experimental setup, seeSI section 2. RIFRAFref scales less well thanthe other methods, taking about twice as longas MAFFT_FB and POA_FB for the full-length se-quences. However, it is comparable with theothers at ` = 900, and faster than the othersfor ` < 600. This difference in speed is due

9

.CC-BY 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which wasthis version posted December 3, 2017. ; https://doi.org/10.1101/227520doi: bioRxiv preprint

Page 10: RIFRAF: a frame-resolving consensus algorithmDec 03, 2017  · RIFRAF Eren and Murrell November 2017 bioRxiv preprint col [6], it is now possible to perform full-length sequencing

RIFRAF • Eren and Murrell • November 2017 • bioRxiv preprint

2 3 4 5 6 8 10 15 20number of sequences

0

10

20

30

40

runn

ing

time

(sec

onds

)RIFRAF_refRIFRAF_nrPOA_FBPOAMAFFT_FBMAFFT

(a) Mean execution time versus number of sequences.

300 600 900 1200 1800 2100 2565template length

0

2

4

6

8

10

12

14

runn

ing

time

(sec

onds

)

RIFRAF_refRIFRAF_nrPOA_FBPOAMAFFT_FBMAFFT

(b) Mean execution time versus template length.

Figure 7: Mean execution time, varying both number of sequences (left) and sequence length (right). Note that intervalson the x-axis are not linear.

to RIFRAF’s iterative approach, which requiresrecomputing parts of each pairwise alignmentafter every iteration.

Detecting true frameshifts. In the otherexperiments, strict frameshift penalties wereused to ensure the consensus stays in-frame.However, sometimes frameshifts are biologi-cally plausible, such as in integrated (but non-functional) proviral Env sequences, or in thecytoplasmic tail of Env leading to a truncation,but preserving infectivity. If true frameshiftsmay occur in the template sequence, it maybe preferable to relax this frameshift penalty.RIFRAFref can be tuned to accept frameshiftindels with enough support in the reads, withonly a small increase in the frequency ofspurious frameshift indels. To demonstratethis, single base insertions and deletions wereadded to NL4-3 in both homopolymer and non-homopolymer regions (details in SI section 3).tindel was set to 1.05, and the max frameshiftindel penalty multiplier m varied from 0 to 12.We call an in-frame sequence a “positive”, soincreasing m increases the false positive rate byforcing sequences with real frameshifts incor-rectly into frame. To get the true positive rate,RIFRAFref was also run on the unmodified se-quences. Note that while we introduce only asingle true indel into our “negative” cases, theanalysis is always at the whole-sequence level.We are not just detecting the presence or ab-sence of the specific indel we introduce. Thusto achieve a high true positive and low falsepositive rate, RIFRAFref must successfully ig-nore spurious indels at any position in the “pos-

itive” cases, while successfully identifying thereal indel we introduce in each “negative” case.The resulting ROC curves, which appear in Fig-ure 8 for N ∈ 3, 5, 10, show that RIFRAFref canfind true indels while controlling the false posi-tive rate, using either a closely related referenceor a distant one. A useful trade-off occurs form = 6, which scores close to the maximum truepositive rate while keeping the false positiverate close to zero.

In agreement with the accuracy results, amore closely related reference (HXB2) im-proved inference for N = 3 for this task. As ex-pected, real homopolymer indels in homopoly-mer regions are harder to discriminate thannon-homopolymer indels (See SI section 3 formore detailed results).

IV. Conclusion

RIFRAF uses quality scores and a reference se-quence to infer accurate frame-corrected con-sensus sequences. It can often find the cor-rect consensus, even from small numbers ofreads or with a distant reference, as shown inour experimental results. RIFRAF with framecorrection can be slower than taking a con-sensus from a multiple sequence alignment,but in experiments with real SMRT sequencesit finds consensus sequences that are signifi-cantly more accurate. The benefits of usinga reference to reduce frameshift errors areespecially apparent when comparing trans-lated amino acid sequences, where a single

10

.CC-BY 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which wasthis version posted December 3, 2017. ; https://doi.org/10.1101/227520doi: bioRxiv preprint

Page 11: RIFRAF: a frame-resolving consensus algorithmDec 03, 2017  · RIFRAF Eren and Murrell November 2017 bioRxiv preprint col [6], it is now possible to perform full-length sequencing

RIFRAF • Eren and Murrell • November 2017 • bioRxiv preprint

0.0 0.2 0.4 0.6 0.8 1.0False positive rate

0.0

0.2

0.4

0.6

0.8

1.0Tr

ue p

ositi

ve ra

te

(a) N = 3

0.0 0.2 0.4 0.6 0.8 1.0False positive rate

0.0

0.2

0.4

0.6

0.8

1.0

True

pos

itive

rate

(b) N = 5

0.0 0.2 0.4 0.6 0.8 1.0False positive rate

0.0

0.2

0.4

0.6

0.8

1.0

True

pos

itive

rate

(c) N = 10

Figure 8: ROC curves for true indel experiments, with max indel penalty multiplier m varying from 0 to 12. Theorange point denotes results from RIFRAFnr, while the remainder of the curve was generated by RIFRAFref.The green point corresponds to a max indel penalty multiplier m = 6. Both a related reference (HXB2, blue)and distant reference (B.BR, red) were used.

frameshift causes the entire downstream se-quence to be incorrect. Finally, RIFRAF candetect and retain true frameshifts during framecorrection, and, to our knowledge, is the onlymethod capable of this.

While RIFRAF performs well with distantlyrelated reference sequences, performance is im-proved when using closely related references.However, when sequencing diverse popula-tions, we note that it is always possible to firstinfer a set of autologous sequences from clus-ters or primer ID bundles that have a largenumber of reads, and so should be accurate.These can then be used as references to cor-rect the reading frame of the less-representedmembers of the population, providing an im-proved accuracy over just using a more dis-tantly related reference. We recommend usingthis strategy whenever possible.

RIFRAF can improve the ability to resolve mi-nority variants in sequenced populations. Itsability to find results comparable to MAFFT withthree times fewer reads will be essential foridentifying minority variants in the populationwith greater precision. More generally, RIFRAFwill be useful whenever an accurate consensussequence must be inferred from a small num-ber of full-length sequences, especially when

quality scores and a reference sequence areavailable.

When sequencing any population, it is oftenadvisable to sequence a clonal representative ofthat population first (NL4-3 env here), to inves-tigate the sequencing performance for that case.We recommend using such sequence datasetsto investigate the behavior of RIFRAF on newgenes, especially if the user seeks to detect realframeshifts. To this end, we provide a Jupyternotebook that allows one to replicate the accu-racy and ROC analyses from this manuscripton any clonal amplicon dataset.

RIFRAF will continue to be developed alongmultiple lines. First, the current approach forperforming frame correction needs to be faster,to keep pace with the increasing volume ofavailable sequence data. Further work needs tobe done to speed it up via optimization or algo-rithmic advances. Possible approaches include:re-using partial alignments, speeding up align-ments with k-mer seeding, and only correctingthe frame of obviously problematic regions.Another improvement would include aminoacid matching penalties in the reference-to-template alignment, which would allow evenmore distantly related reference sequences tobe used, where the nucleotide homology has

11

.CC-BY 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which wasthis version posted December 3, 2017. ; https://doi.org/10.1101/227520doi: bioRxiv preprint

Page 12: RIFRAF: a frame-resolving consensus algorithmDec 03, 2017  · RIFRAF Eren and Murrell November 2017 bioRxiv preprint col [6], it is now possible to perform full-length sequencing

RIFRAF • Eren and Murrell • November 2017 • bioRxiv preprint

been completely obliterated. Another usefulfeature would be to infer calibrated qualityscores for the consensus sequence, in orderto communicate uncertain regions to the user.Finally, RIFRAF is extensible to other systemsand sequencing technologies. In particular, weplan to investigate its behavior and tune itserror model for Oxford Nanopore data, and toextend the method to support amplicons con-taining both non-coding and coding regions,which may contain different (potentially over-lapping) reading frames.

The RIFRAF source code is available at https://github.com/MurrellGroup/Rifraf.jl.

Acknowledgements

We are grateful to Sasha Murrell for copy edit-ing.

Funding

Research reported here was supported bythe National Institute Of Allergy And Infec-tious Diseases of the National Institutes ofHealth under Award Numbers R00AI120851and R01AI120009, and in part by the Univer-sity of California, San Diego Center for AIDSResearch (P30AI036214). K.E. was supportedin part by R21AI115701. The content is solelythe responsibility of the authors and does notnecessarily represent the official views of theNational Institutes of Health.

References

[1] Jeff Bezanson, Alan Edelman, StefanKarpinski, and Viral B Shah. Julia: A freshapproach to numerical computing. SIAMReview, 59(1):65–98, 2017.

[2] Kun Mao Chao, Ross C. Hardison, andWebb Miller. Constrained sequence align-ment. Bulletin of Mathematical Biology,55(3):503–524, 1993.

[3] Kun Mao Chao, William R. Pearson, andWebb Miller. Aligning two sequences

within a specified diagonal band. Bioinfor-matics, 8(5):481–487, 1992.

[4] Chen-shan Chin, David H Alexander,Patrick Marks, Aaron A Klammer, JamesDrake, Cheryl Heiner, Alicia Clum, AlexCopeland, John Huddleston, Evan E Eich-ler, Stephen W Turner, and Jonas Korlach.Nonhybrid , finished microbial genomeassemblies from long-read SMRT sequenc-ing data. 10(6), 2013.

[5] Nan Du and Yanni Sun. Improve homol-ogy search sensitivity of PacBio data bycorrecting frameshifts. In Bioinformatics,volume 32, pp. i529–i537, 2016.

[6] J. Eid, A. Fehr, J. Gray, K. Luong, J. Lyle,G. Otto, P. Peluso, D. Rank, P. Bay-bayan, B. Bettman, A. Bibillo, K. Bjorn-son, B. Chaudhuri, F. Christians, R. Ci-cero, S. Clark, R. Dalal, A. DeWinter,J. Dixon, M. Foquet, A. Gaertner, P. Hard-enbol, C. Heiner, K. Hester, D. Holden,G. Kearns, X. Kong, R. Kuse, Y. Lacroix,S. Lin, P. Lundquist, C. Ma, P. Marks,M. Maxham, D. Murphy, I. Park, T. Pham,M. Phillips, J. Roy, R. Sebra, G. Shen,J. Sorenson, A. Tomaney, K. Travers,M. Trulson, J. Vieceli, J. Wegener, D. Wu,A. Yang, D. Zaccarin, P. Zhao, F. Zhong,J. Korlach, and S. Turner. Real-TimeDNA Sequencing from Single PolymeraseMolecules. Science, 323(5910):133–138,2009.

[7] Cassandra B Jabara, Corbin D Jones, Jef-frey Roach, Jeffrey A Anderson, andRonald Swanstrom. Accurate samplingand deep sequencing of the hiv-1 proteasegene using a primer id. Proceedings of theNational Academy of Sciences, 108(50):20166–20171, 2011.

[8] Kazutaka Katoh, Kazuharu Misawa, Kei-ichi Kuma, and Takashi Miyata. MAFFT: anovel method for rapid multiple sequencealignment based on fast Fourier trans-form. Nucleic acids research, 30(14):3059–3066, 2002.

12

.CC-BY 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which wasthis version posted December 3, 2017. ; https://doi.org/10.1101/227520doi: bioRxiv preprint

Page 13: RIFRAF: a frame-resolving consensus algorithmDec 03, 2017  · RIFRAF Eren and Murrell November 2017 bioRxiv preprint col [6], it is now possible to perform full-length sequencing

RIFRAF • Eren and Murrell • November 2017 • bioRxiv preprint

[9] Kazutaka Katoh and Daron M. Standley.MAFFT multiple sequence alignment soft-ware version 7: Improvements in perfor-mance and usability. Molecular Biology andEvolution, 30(4):772–780, 2013.

[10] Kazutaka Katoh and Daron M Standley.Mafft multiple sequence alignment soft-ware version 7: improvements in perfor-mance and usability. Molecular biology andevolution, 30(4):772–780, 2013.

[11] Melissa Laird Smith, Ben Murrell, Ke-mal Eren, Caroline Ignacio, Elise Landais,Steven Weaver, Pham Phung, ColleenLudka, Lance Hepler, Gemma Caballero,Tristan Pollner, Yan Guo, Douglas Rich-man, Pascal Poignard, Ellen E. Paxinos,Sergei L. Kosakovsky Pond, and Davey M.Smith. Rapid Sequencing of Complete envGenes from Primary HIV-1 Samples. VirusEvolution, 2(2), 2016.

[12] Christopher Lee. Generating consensussequences from partial order multiple se-quence alignment graphs. Bioinformatics,19(8):999–1008, 2003.

[13] Christopher Lee, Catherine Grasso, andMark F. Sharlow. Multiple sequence align-ment using partial order graphs. Bioinfor-matics, 18(3):452–464, 2002.

[14] Nicholas J Loman, Joshua Quick, andJared T Simpson. A complete bacterialgenome assembled de novo using onlynanopore sequencing data. Nature Meth-ods, 12(8):733–735, 2015.

[15] Niranjan Nagarajan and Mihai Pop. Se-quence assembly demystified. Nature Re-views Genetics, 14(3):157–167, 2013.

[16] Saul B. Needleman and Christian D. Wun-sch. A general method applicable to thesearch for similiarities in the amino acidsequence of two proteins. Journal of molec-ular biology, 48(3):443–453, 1970.

[17] Sankar K. Pal, Sanghamitra Bandyopad-hyay, and Shubhra Sankar Ray. Evolu-tionary computation in bioinformatics: A

review. IEEE Transactions on Systems, Manand Cybernetics Part C: Applications and Re-views, 36(5):601–615, 2006.

[18] Konrad Paszkiewicz and David J.Studholme. De novo assembly of shortsequence reads. Briefings in Bioinformatics,11(5):457–472, 2010.

[19] Pervez, M Babar, Asif Nadeem, M Aslam,A Awan, Naeem Aslam, Tanveer Hussain,Nasir Naveed, Salman Qadri, Usman Wa-heed, and Muhammad Shoaib. Evaluatingthe Accuracy and Efficiency of MultipleSequence Alignment Methods. Evolution-ary Bioinformatics, p. 205, dec 2014.

[20] Peter Rice, Ian Longden, and Alan Bleasby.Emboss: the european molecular biologyopen software suite, 2000.

[21] Sergey L. Sheetlin, Yonil Park, Martin C.Frith, and John L. Spouge. Frameshiftalignment: Statistics and post-genomicapplications. Bioinformatics, 30(24):3575–3582, 2014.

[22] Daniel J Sheward, Ben Murrell, and Car-olyn Williamson. Degenerate primerids and the birthday problem. Proceed-ings of the National Academy of Sciences,109(21):E1330–E1330, 2012.

[23] Temple F. Smith and Michael S. Waterman.Comparison of biosequences. Advances inApplied Mathematics, 2(4):482–489, 1981.

[24] Qiong Wang, J. F. Quensen, Jordan A Fish,T. Kwon Lee, Yanni Sun, James M. Tiedje,and James R. Cole. Ecological Patterns ofnifH Genes in Four Terrestrial ClimaticZones Explored with Targeted Metage-nomics Using FrameBot, a New Informat-ics Tool. mBio, 4(5):e00592–13–e00592–13,sep 2013.

[25] Y Zhang and Y Sun. HMM-FRAME:accurate protein domain classificationfor metagenomic sequences containingframeshift errors. Bmc Bioinformatics,12:198, 2011.

13

.CC-BY 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which wasthis version posted December 3, 2017. ; https://doi.org/10.1101/227520doi: bioRxiv preprint