and$other$discrete$problems$$...
Transcript of and$other$discrete$problems$$...
Algorithms on strings
and other discrete problems in bioinforma2cs
Zsuzsanna Lipták University of Verona
30 Oct 2013
Zsuzsanna Lipták CBMC 30/10/2013 1
AbstracBon: Linear molecules = strings
Zsuzsanna Lipták CBMC 30/10/2013 2
...AACAGTACCATGCTA...
...TTGTCATGGTACGAT... ...SLDILRRKSLMNYWL...
More Strings
• SNPs for haplotyping
Zsuzsanna Lipták CBMC 30/10/2013 3
0 1 1 0 0 1 0 0 1 1 0 1 0 2 1 2 0 1
0 1 1 1 0 1 0 0 1 0 0 1 0 2 1 2 0 1 0 2 1 2 0 1
More Strings
• Computa2onal genomics (genome rearrangements, gene clusters)
Zsuzsanna Lipták CBMC 30/10/2013 4
Characters = genes
Problem: How to get from (1,-‐7,6,-‐10, …) to (1,2,3,4,5,…,10) using specified opera2ons
More Strings
• Computa2onal genomics (genome rearrangements, gene clusters)
Zsuzsanna Lipták CBMC 30/10/2013 5
Problem: Iden2fy maximal substrings appear-‐ ing together in each genome
Characters = genes
More Strings
• Mo2f search in protein interac2on networks (joint work with F.Cicalese,T.Gagie, E.Giaquinta,E.S.Laber,R.Rizzi, A.Tomescu, 2013)
Zsuzsanna Lipták CBMC 30/10/2013 6
Jumbled Patterns and Graph motifs
Paths are like strings
Strings are everywhere
Strings (= sequences) are finite sequences over some alphabet Σ • DNA/RNA (nucleo2des) Σ = {A,C,G,T} or {A,C,G,U} • Protein (amino acids): Σ = {A,R,N,D,C,…} • SNPs/haplotyping (alleles): Σ = {0,1} or {0,1,2} • computa2onal genomics (genes): Σ ={1,2,3,…,n} • …
Zsuzsanna Lipták CBMC 30/10/2013 7
Some major problems on strings
• Storage and retrieval (compression, succint data structures) -‐> all appl.s
• PaKern matching (exact, approximate) -‐> database search, iden2fica2on, gene mapping, …
• Comparing strings (similarity, distance) -‐> phylogene2cs, sequencing, expression studies, …
• …
Zsuzsanna Lipták CBMC 30/10/2013 8
Complexity, efficiency!! 1) Storage space 2) computaBon Bme
Some applica2ons
Things I have worked on/am working on
• Mass spectrometry data interpreta2on (PhD from Bielefeld University, MS research group)
• Expression clustering (ESTs) (with SANBI, Cape Town, and Wits Univ. Jo’burg, South Africa)
• Non-‐standard string matching (current) (different collabora2ons)
• Informa2on extrac2on from and efficient storage of biological data (current) (with G. Franco, V. Manca)
Zsuzsanna Lipták CBMC 30/10/2013 9
Zsuzsanna Lipták CBMC 30/10/2013 10
TOPIC 1: Expression clustering
and string similarity measures
Expression clustering
11 CBMC 30/10/2013 Zsuzsanna Lipták
DNA
RNA
mRNA
protein
transcip+on
transla+on
splicing
exon3 exon2 exon1
ESTs are cDNA sequences won from mRNAs (lab process)
transcriptome
Expression clustering
12 CBMC 30/10/2013 Zsuzsanna Lipták
Wanted: clusters s.t. each cluster corresponds to a gene
Given: set of sequences (105 or more, length 200-‐800)
Expression clustering
Uses: • expression studies (e.g. diseased vs. healthy) • gene discovery • SNP detec2on • discovery of products of alterna2ve splicing • es2ma2ng no. of genes
Technology: • tradi2onally: Sanger style sequencing -‐> ESTs • recently also: short read sequencing (454, Illumina) 13 CBMC 30/10/2013 Zsuzsanna Lipták
Expression clustering
Problem: Given a set of strings S, find a par22on C1...,Ck of S s.t. if s1,s2 “products of the same gene” then s1,s2 in same cluster. What does that mean? Problem: Given a set of strings S, find a par22on C1...,Ck of S s.t.
if s1,s2 are “similar” then s1,s2 in same cluster. Which is the correct similarity measure?
14 CBMC 30/10/2013 Zsuzsanna Lipták
String (dis)similarity measures
Example: s1 = CAAGACAA, s2 = CAGAGCAC
• alignment / edit distance (Smith-‐Waterman 1981) • q-‐grams (Ukkonen 1992) • d2 (Torney et al 1990)
• others (long exact match, fingerprints, informa2on theory based measures)
alignment vs. non-‐alignment-‐based string similarity
15 CBMC 30/10/2013 Zsuzsanna Lipták
s1: CAAGA-CAA!s2: CA-GAGCAC!
AA 2 0 2 4 AC 1 1 0 0 AG 1 2 1 1 CA 2 2 0 0 GA 1 1 0 0 GC 0 1 1 1 -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ 4 6
= 3 Σw |freq(w,s1) – freq(w,s2)| = 4
Σw (freq(w,s1) – freq(w,s2))2 = 6
Subword-‐based string similarity measures
What’s wrong with alignment? It’s too slow!
• d2 has been shown to be well suited for EST clustering
(Torney et al 90, Hide et al 99, Hazelhurst et al 2008) • Claims that d2 beter suited than alignment.
16 CBMC 30/10/2013 Zsuzsanna Lipták
alignment subwords (d2,q-‐grams)
two strings of length n O(n2) O(n) min over all pairs of windows O(n3) O(n2)
ContribuBons (expression clustering)
Joint work with Scot Hazelhurst and others. • ECLEST: a tool for rigorous comparison of string similarity
measures for EST clustering (masters thesis J. Zimmermann) (IEEE-‐BIBE, 2004)
• WCD (wicked!): EST clustering tool using d2 (Bioinforma2cs, 2008)
• KABOOM! (WCD6): new expression clustering tool – both Sanger style and short read sequences – uses suffix arrays
– Bioinforma2cs, 2011 – see next slide
17 CBMC 30/10/2013 Zsuzsanna Lipták
KABOOM!
• Using subword based similarity, problem s2ll: computa2on 2me!
• based on pairwise comparison,
Θ(m2n2) 2me inacceptable! (remember: m = 105, 106)
• Need: heuris2cs to get rid of m2
18 CBMC 30/10/2013 Zsuzsanna Lipták
KABOOM! mul2ple exact
matches
• KABOOM heuris2c: s1 and s2 must have at least A many k-‐long exact matches at least B apart
• implementa2on with a modified suffix array of concatenated string s1xs2xs3x...xsm of size O(nm). Reduces no. comparisons to 5% or less! We pay in storage space but s2ll acceptable.
19 CBMC 30/10/2013 Zsuzsanna Lipták
Zsuzsanna Lipták CBMC 30/10/2013 20
TOPIC 2: Mass spectrometry
and the Money Changing Problem
Zsuzsanna Lipták CBMC 30/10/2013 21
unknown molecular mixture
?
mass spectrum
AARLSTRACLSAAIS… LSDESMFGHEESLR… SRILSRLELPSGILGG… QEKLHGEERALPSK… ECDNRAALIGRSEDV… …
iden+fica+on
(DNA, protein, metabolites…)
(names, sequences, data base iden+fiers, molecular structure…)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 1000 2000 3000 4000 5000 6000 7000 8000
intensity
mass
Mass spectrometry
Zsuzsanna Lipták CBMC 30/10/2013 22
Mass spectrometry
• input: unknown molecular mixture (sample)
• output: list of masses of the sample molecules (mass in Da, intensity) actually: m/z = mass over charge
• intensity propor2onal to abundance: how oyen that mass was measured (ideally!)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 1000 2000 3000 4000 5000 6000 7000 8000
intensity
mass
Zsuzsanna Lipták CBMC 30/10/2013 23
Mass decomposiBon
• Given: query mass M (in Da) • Known: What type of molecules
are in sample? (DNA, protein, ... )
• Ques2on: What molecules can have this mass? 0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 1000 2000 3000 4000 5000 6000 7000 8000
intensity
mass
amino acids (k = 20) nucleotides (k = 4) “bioatoms” (k = 6) Ala (A) 71.079 Da A 313.058 C 12.0 O 15.995
Arg (R) 156.101 C 289.046 H 1.008 P 30.074
Asn (N) 114.043 G 329.053 N 14.003 S 31.972
... T 304.046
Zsuzsanna Lipták CBMC 30/10/2013 24
Mass decomposiBon
NB: • output will be: composi2on/sum formula (not: sequence
or molecular structure!) EGAEEYSSFL ⇒ A1E3G1L1F1S2Y1 AACGTAGGAA ⇒ A4C1G3T1
⇒ C10H16N5O13P3 (sum formula)
• take into account error (measurement, computa2on)
Zsuzsanna Lipták CBMC 30/10/2013 25
Mass decomposiBon – One applicaBon
• A professor at the Ins2tute of Pathophysiology at a German university wants to patent a pentapep2de (5 AAs)
• molecular mass = 609.2 Da • Patent judge says: from mass follows sequence • Ques2on: Is this true?
Zsuzsanna Lipták CBMC 30/10/2013 26
• Given: • k coin denomina2ons • number M
• Ques2on: How can we make change for M?
The Money Changing Problem (MCP)
5 4 7
5 4 4 7 7 5 19 = 3 ·∙ + 1·∙ = 3 ·∙ + 1·∙ = 2·∙ + 1·∙
Zsuzsanna Lipták CBMC 30/10/2013 27
TranslaBng the MS problem into MCP
Ala (A) 71.079 Da
Arg (R) 156.101
Asn (N) 114.043
...
7108
15610
11404 precision = 0.01
etc.
Given: query M, error bound ε. Compute all decomposi2ons of masses between M -‐ ε and M + ε (scaled to integers with factor 1/precision).
ContribuBons (MS and MCP)
Joint work with Sebas2an Böcker and others. • a new algorithm for the Money Changing Problem
(COCOON 2005, ACM-‐SAC 2005, Algorithmica 2007)
• DECOMP: tool implemen2ng this algo (Bioinforma2cs, 2008) http://bibiserv.techfak.uni-bielefeld.de/decomp/!
• iden2fica2on of metabolites using isotopic info (WABI 2006)
• SIRIUS: tool for metabolite iden2fica2on (Bioinforma2cs 2009)
• AUDENS: a tool for tandem MS spectra (Proteomics 2005)
28 CBMC 30/10/2013 Zsuzsanna Lipták
Zsuzsanna Lipták CBMC 30/10/2013 29
Back to the pentapepBde story
• A professor at the Ins2tute of Pathophysiology at a German university wants to patent a pentapep2de (5 AAs)
• molecular mass = 609.2 Da • Patent judge says: from mass follows sequence • Ques2on: Is this true?
Ques+on is, do we agree with paten+ng pep+des? ...
• no! e.g. QYNAD (120 permuta2ons), AENNY (60), EFNNS (60), DGQQY (60), ... = 7 sum formulas (decomposi2ons) and 600 sequences
• if length not fixed: 16 decomp’s and 3150 seq’s
Zsuzsanna Lipták CBMC 30/10/2013 30
TOPIC 3: Non-‐standard string matching
Zsuzsanna Lipták CBMC 30/10/2013 31
MS: PepBde Mass FingerprinBng (PMF)
? unknown protein
pep2des (fragments)
diges+on
list of masses 185.08, 234.67, 498.08, 515.67, 556.09, 677.98
MS experiment
RLRKRLSSLNEAPPILTPTQ... SPSEHQDSPRIVGGLVGAG... VQPVCLPRSVASSAEPEG.. AAFTPGMLCAGFLEGGTD... LRGIVSWGSGGAHPYIAALY
matching protein?
database lookup
Zsuzsanna Lipták CBMC 30/10/2013 32
Weighted string matching
Σ finite alphabet, w: Σ −> IN weight (mass) func2on. Weight of string is sum of weights of chars: w(t1...tm) = w(t1) + ... + w(tm). Given text s and a mass M, find all substrings of s with mass M. Example: w(a) = 4, w(b) = 5, w(c) = 7, M = 31.
b b a c a c c a b a b b a a c c a b a c !
31 = a4 + b3 = a3 + b + c2 = a + b4 + c = a6 + c = b2 + c3
Zsuzsanna Lipták CBMC 30/10/2013 33
Jumbled string matching
b b a c a c c a b a b b a a c c a b a c !
Σ finite alphabet, a Parikh vector of string t over Σ counts mul2plici2es of chars, e.g. p(accaba) = (3,1,2). Given text s and a Parikh vector q, find all substrings of s with Parikh vector q. Example: q = (3,1,2) – find all jumbled (permuted) versions of accaba.
Jumbled string matching
Some applica2ons: • mass spectrometry • gene clusters • patern discovery • filter for exact string matching • mo2f search in protein interac2on networks
Zsuzsanna Lipták CBMC 30/10/2013 34
b b a c a c c a b a b b a a c c a b a c !
ContribuBons (non-‐standard string matching)
Joint work with … • algorithms for weighted string matching
(IFIP-‐TCS 2001, COSSAC 2001, Discr Applied Math 2004, CPM 2005, Discr Applied Math 2007)
• algorithms for jumbled string matching (PSC 2009, FUN 2010, Int. J Found. of Comp. Sc. 2012, Theory of Comp. Systems 2012, IPL 2013, SPIRE 2013)
• string reconstruc2on (IWOCA 2010, J Discr Alg 2012)
• language of prefix normal words (DLT 2011)
35 CBMC 30/10/2013 Zsuzsanna Lipták
Summary
• String problems arising in bioinforma2cs (in general: discrete, combinatorial problems)
• Many different types of strings in bioinforma2cs
• Efficient storage, efficient algorithms mater • Many interes2ng and unsolved problems
Zsuzsanna Lipták CBMC 30/10/2013 36