Optimal k -mer superstrings for protein identification and DNA assay design.

Optimal k-mer superstrings

for protein identification

and DNA assay design.

Optimal k-mer superstrings

for protein identification

and DNA assay design.Nathan EdwardsCenter for Bioinformatics and Computational BiologyUniversity of Maryland, College Park

2

k-mer (Sub-)Problems

• Enumerate:• For all (distinct) k-mers, do

• Existence:• ...with respect to exact (& inexact) count ¸ x

• Uniqueness:• ...with respect to exact & inexact match

• Near-neighbors:• ...with respect to inexact match

• Representation:• Represent (distinct) k-mers for other tools• Fast annotation of k-mer counts on original sequences

3

Applications of k-mer sets

• Peptide Identification• Represent all amino-acid 30-mers• ...that occur at least twice in human dbEST

• PCR Primer Design:• Test DNA 20-mer primers for uniqueness• What does it mean to be unique?

• DNA sequencing error / repeat detection• Eliminate mers that are too rare or too frequent

• Pathogen signatures• Near-neighbors imply potential false-positives

4

k-mer Superstring Problem

Given• A set of sequences S = { S1, ..., Sn }

• Sequence database• Word size kFind• A new set of sequences T = { T1, ..., Tm }Such that• Total length of T is minimized, and• T is complete and correct w.r.t. k-mers of S

5

k-mer Superstring Problem

• Completeness• All of the k-mers of S are represented

• Correctness• No additional k-mers are present

• Minimize the total representation length• Correlates with running time

6

Shortest (common) superstring problem

• General strings (arbitrary length)• Single output string• Completeness for input sequences only

• Classical NP-hard problem• Garey and Johnson

• Approximate within ~ 2.5*OPT• Max-SNP hard

• One of the first algorithmic approaches to genome assembly

7

de Bruijn Sequences

de Bruijn sequences represent all words of length k from some alphabet A.

A = {0,1}, k = 3: s = 0001110100

A = {0,1}, k = 4: s = 0000111101011001000

8

de Bruijn Graph: A = {0,1}, k = 4

110

011

100

001

000 010 111101

1 1

11

1

11

1

00

0

00

0

00

9

de Bruijn Sequences & Graphs

de Bruijn graphs (k,A):• Edges represent length k words from A• Each node has

• in degree |A|• out degree |A|

Eulerian tour constructs de Bruijn sequence.

10

Sequencing-by-Hybridization-graph

ACDEFGI, ACDEFACG, DEFGEFGI

11

Compressed SBH-graph


12

Sequence Databases & CSBH-graphs

• Original sequences correspond to paths


13

C3 Enumeration

• Complete• All k-mers are present

• Correct• No other k-mers are present

• Compact• No k-mer is present more than once

14

Correct, Complete, Compact (C3) Enumeration

• Set of paths that use each edge exactly once

ACDEFGEFGI, DEFACG

15

Correct, Complete (C2) Enumeration

• Set of paths that use each edge at least once

ACDEFGEFGI, DEFACG

16

Patching the CSBH-graph

• Use artificial edges to fix unbalanced nodes

17

Patching the CSBH-graph

• Use matching-style formulations to choose artificial edges• Optimal C2/C3 enumeration in polynomial time.

• Chinese Postman Problem• Edmonds and Johnson, ’73

• l-tuple DNA sequencing• Pevzner, ’89

• Shortest (Common) Superstring• MAX-SNP-hard, 2.5 approx algorithm

18

Related work

• Chinese Postman Problem• Undirected graph, weighted edges• Shortest path that uses all the edges

• Solvable in polynomial time• Construct minimum weighted matching

between nodes of odd-degree• Add matching to graph and find Eulerian

path• Minimize weight of extra edges used

19

C2 Enumeration

• Chinese postman problem, except:• Directed graph

• Add edges from nodes with surplus in-degree to nodes with surplus out-degree

• Fixed cost teleportation option• Can always “start” a new sequence

• Find optimal set of additional edges• Transportation problem / min cost flow

instance

20

C3 Enumeration

1

3

2

1

3

-2

-1

-4

-1

-2Cost: k

#in-#out #in-#out

21

Reusing Edges

ACD HAC

EHAC

FHAC

GHAC

D

• ACDEHAC, ACDFHAC, ACDGHACD

22

• C3: ACDEHACDFHAC, ACDGHACD

Reusing Edges

ACD HAC

EHAC

FHAC

GHAC

D

$ACD

23

• C2: ACDEHACDFHACDGHAC

Reusing Edges

ACD HAC

EHAC

FHAC

GHAC

D

D

24

C2 Enumeration

1

3

2

1

3

-2

-1

-4

-1

-2

4

7

10

“Shortcut paths”

#in-#out #in-#out

25

C3 Enumeration

1

3

2

1

3

-2

-1

-4

-1

-2

#in-#out #in-#out

Cost: k

0 0

Cost: 0 Cost: 0

26

Sample Preparation for Peptide Identification

Enzymatic Digestand

Fractionation

27

Single Stage MS

MS

m/z

28

Tandem Mass Spectrometry(MS/MS)

Precursor selection

m/z

m/z

29

Tandem Mass Spectrometry(MS/MS)

Precursor selection + collision induced dissociation

(CID)

MS/MS

m/z

m/z

30

Peptide Identification

• For each (likely) peptide sequence1. Compute fragment masses2. Compare with spectrum3. Retain those that match well

• Peptide sequences from protein sequence databases• Swiss-Prot, IPI, NCBI’s nr, ...

• Automated, high-throughput peptide identification in complex mixtures

31

Novel Splice Isoform

• Human Jurkat leukemia cell-line• Lipid-raft extraction protocol, targeting T cells• von Haller, et al. MCP 2003.

• LIME1 gene:• LCK interacting transmembrane adaptor 1

• LCK gene:• Leukocyte-specific protein tyrosine kinase• Proto-oncogene• Chromosomal aberration involving LCK in leukemias.

• Multiple significant peptide identifications

32


http://codon.umiacs.umd.edu:8891/thegpm-cgi/peptide.pl?path=/tandem/archive/GPM00300000340.3.xml&uid=53361&label=AAAACKOM&homolog=AAAACKOM&id=895.1.1&proex=-1

33


http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr20:61839670-61839828&hgsid=68112320&est=pack

34

Novel Mutation

• HUPO Plasma Proteome Project• Pooled samples from 10 male & 10 female

healthy Chinese subjects• Plasma/EDTA sample protocol• Li, et al. Proteomics 2005. (Lab 29)

• TTR gene• Transthyretin (pre-albumin) • Defects in TTR are a cause of amyloidosis.• Familial amyloidotic polyneuropathy

• late-onset, dominant inheritance

35

Novel Mutation

Ala2→Pro associated with familial amyloid polyneuropathy

http://codon.umiacs.umd.edu:8891/thegpm-cgi/peptide.pl?path=/tandem/archive/GPM00300002887.18.xml&uid=202568&label=AAAKEPZA&homolog=AAAKEPZA&id=1838.1.1&proex=-1

36

Novel Mutation

http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr18:27426944-27426971&hgsid=68063647

http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr18:27426944-27426971&hgsid=68063647

37

Compressed EST Peptide Sequence Database

• For all ESTs mapped to a UniGene gene:• Six-frame translation• Eliminate ORFs < 30 amino-acids• Eliminate amino-acid 30-mers observed once• Compress to C2 FASTA database

• Complete, Correct for amino-acid 30-mers

• Inclusive gene-centric peptide sequence database:• Size: < 3% of naïve enumeration, 20774 FASTA entries• Running time: ~ 1% of naïve enumeration search• E-values: ~ 2% of naïve enumeration search results

38

Compressed EST Peptide Sequence Database

• For all ESTs mapped to a UniGene gene:• Six-frame translation• Eliminate ORFs < 30 amino-acids• Eliminate amino-acid 30-mers observed once• Compress to C2 FASTA database

• Complete, Correct for amino-acid 30-mers

• Gene-centric peptide sequence database:• Size: < 3% of naïve enumeration, 20774 FASTA entries• Running time: ~ 1% of naïve enumeration search• E-values: ~ 2% of naïve enumeration search results

39

Sequence Databases & CSBH-graphs

• All k-mers represented by an edge have the same count

2 2

1

2

1

40

CSBH-graph subgraphs

• Quickly determine those that occur twice

2 2

1

2

41

k-mer (Sub-)Problems

• Enumerate:• For all (distinct) k-mers, do

• Existence:• ...with exact (& inexact) count ¸ x

• Uniqueness:• ...exact & inexact match

• Near-neighbors:• ...inexact match

• Representation:• Represent (distinct) k-mers for other tools• Fast annotation of k-mer counts on original sequences

42

Large scale instances!

• CSBH-graph instances• Partition set of all k-mers, determine non-trivial nodes• Days on condor grid (250 CPUs) to construct• ¸ 100,000,000 nodes and edges (sparse & dense)

• Min-cost flow instances• ¸ 500,000 nodes and edges

• Algorithms must be linear in problem size• Out-of-core Eulerian path algorithm?

• Currently testing out-of-core connected-components

43

Grid computing

• Heterogeneous machines• Varying disk/memory/MHz/cores capabilities

• Centralized scheduler• Jobs started asynchronously• Other jobs may preempt current job

• Input files may need to be staged• 250 simultaneous requests for a 3Gb file?• How to guarantee integrity of input files?

• Problem decomposition may be non-trivial• Jobs sizes need to fit the least capable machine• Sometimes need to “game” the scheduler

• Need to ensure the integrity of job output

44

Uniqueness Oracles

• Oracle for uniqueness of 20-mers in the Human genome (size: 3Gb)• Count occurrences in the genome: 0,1,2+• Construct 20-mer superstring for 20-mers with

count 1• Construct 20-mer superstring for 20-mers with

count > 1• Easy(-ish) for exact sequence match: O(n)

• Fast automata, hash tables, suffix trees.

45

Polymerase Chain Reaction

46

Polymerase Chain Reaction

47

Inexact sequence match

• Inexact sequence matching O(n*m*k)• Errors/Mismatches (k): 1,2,3• # distinct 20-mers (m): O(n)

• Achieve expected linear time using a hybrid approach (blastn):• Exact search for short chunks of primers• Expensive alignment only where chunks match• Large chunks ) Fast, but miss occurrences• Small chunks ) Slow, find all matches

48

Baeza-Yates Perleberg: • Correct and O(n) for small k

• At least 1 chunk is observed with no error.• Small k → Large chunks → Fast and correct• Form of locality sensitive hashing

Inexact sequence match

≠ = ≠q

g

49

Locality Sensitive Hashing

• For each primer:• store a (set of) hash(es) in hash-table

• At each position in the genome:• look-up a (set of) hash(es) in hash-table• if any hash is found, do more expensive check

• Need to weigh• sensitivity (false negatives) vs• specificity (false positives)

• Our application requires speed and no false negatives!

50

Random Projection

• Choose T templates of l random “care” positions

q

g

51

Random Projection

• Choose T templates of l random “care” positions

t1

g

t1: 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0

52

Random Projection

• Choose T templates of l random positions

t1

t2

g

t1: 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 t2: 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 1

53

Random Projection


t1

t2

g

t1: 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 t2: 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 1

54

Random Projection


t1

t2

g

t1: 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 t2: 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 1

55

Gapped seed-set design problem

• Given:• mer-size: m ( = 20 )• # errors: k ( = 1,2,3)• # cares: l ( = 10,12,14 )

• Find the smallest set of templates with no false negatives.• Minimize running time.

56

Gapped seed set design formulation (for k = 2)

• Cover the edges of Km with copies of Km-l

• How many triangles to cover K6?(m = 6, k = 2, l = 3)

• Some instances of (m,2,m-3) cover each edge exactly once:• Steiner triple systems

57

How many triangles cover K6?

• 15 edges total

• Is 5 triangles possible?

58


• 15 edges total

• Is 5 triangles possible? NO!

59


• 15 edges total


• Each node requires 3 triangles

• Triangles must account for at least 18 “edges”

60


• 15 edges total


• Each node requires 3 triangles

• Triangles must account for at least 18 “edges”

61

Gapped seed set design formulation #2

• Set cover instance:• Ground set: all possible placements of the

k errors (alignments)• Covering sets: all possible placements of

the l care positions

• For (m=20,k=2,l=10),• 190 elements, 184,756 sets!• Greedy approximation algorithm works

62


Templates Positions (m)

l Remove any kposition nodes,

at least 1 templatemust have degree l.

63


• Polynomial size in terms of number of templates

• Select T in advance and test whether sufficient.

• Greedily add 1,2,3,... templates.• Apply iteratively to achieve feasible

solution

64

Solution for (20,2,10)

.................... Positions

********** t1

********** t2

***** ***** t3

***** ***** t4

***** ***** t5

********** t6• Need > 4 templates, 6 is optimal

65

Remember the application!

• We are checking some templates twice!• We compute hash(es) at each position in

the genome

• Any template that is a shift of another will be computed at some nearby genomic position!

66

Solution for (20,2,10)

.................... Positions

********** t1

********** t2

***** ***** t3

***** ***** t4

***** ***** t5

********** t6• Need at most 3 templates...can we do better?

67

Solution for (20,2,10) w/ shift

.................... Positions

**** ** **** t1

**** * ***** t2

• Optimal is 2 templates...

68

Gapped seed set designSolution strategies

• Randomized algorithms• Greedy algorithm

• Directly to set cover instance• Indirectly to bipartite instance

• Integer programming• On set cover and bipartite instances• Solution of greedy algorithm subproblem• ...in parallel, using COIN-OR SYMPHONY

• Branch-and-bound enumeration• Solution of greedy algorithm subproblem• ...in parallel, using COIN-OR ALPS library

69

What about edit-distance?

• Formulations can be generalized• Similar solution strategies can be applied

• (All) symmetry lost!• This may actually be helpful

• Much harder to solve• Is greedy still good?• Solutions typically require more templates

70

Uniqueness Oracles

• Integrated with CSBH-graph construction algorithm• Ensure edge-count property is preserved

• Sequence database of unique / non-unique 20-mers for small genomes• D. melanogaster, up to edit-distance 2

• Currently working to scale to human...

71

Other Projects / Interests

• HMMs for Peptide Spectrum Matching• with UMd, CS

• Rapid Microorganism Identification Database• www.RMIDb.org

• Pathogen detection using Spectral Matching• with USDA

• Locality sensitive hashing• spectra, peptide sequence

• Statistical techniques• statistical significance • importance sampling

• CSBH-graph applications• genome assembly

• Grid computing• Web-applications• Relational databases

72

Future Research Directions

• Extend k-mer superstring algorithms• Range of word sizes, variable length words• Other sequence properties (Tryptic peptides, Tm)

• Identification of protein isoforms:• Optimize proteomics workflow for isoform detection• Identify splice variants in cancer cell-lines (MCF-7)

and clinical brain tumor samples• Aggressive peptide sequence enumeration• dbPep for genomic annotation• Open, flexible informatics infrastructure for peptide

identification

73

Future Research Directions

• Proteomics for Microorganism Identification• Specificity of tandem mass spectra• Revamp RMIDb prototype• Incorporate spectral matching

• Primer design• Uniqueness oracle for inexact match in human• Integration with Primer3• Tiling, multiplexing, pooling, & tag arrays

74

Acknowledgements

• Chau-Wen Tseng, Xue Wu• UMCP Computer Science

• Catherine Fenselau, Steve Swatkoski• UMCP Biochemistry

• Calibrant Biosystems

• PeptideAtlas, HUPO PPP, X!Tandem

• Funding: National Cancer Institute

Optimal k -mer superstrings for protein identification and DNA assay design.

Documents

Transcript of Optimal k -mer superstrings for protein identification and DNA assay design.