and$other$discrete$problems$$...

Algorithms on strings

and other discrete problems in bioinforma2cs

Zsuzsanna Lipták University of Verona

30 Oct 2013

Zsuzsanna Lipták CBMC 30/10/2013 1

AbstracBon: Linear molecules = strings


...AACAGTACCATGCTA...

...TTGTCATGGTACGAT... ...SLDILRRKSLMNYWL...

More Strings

•  SNPs for haplotyping


0 1 1 0 0 1 0 0 1 1 0 1 0 2 1 2 0 1

0 1 1 1 0 1 0 0 1 0 0 1 0 2 1 2 0 1 0 2 1 2 0 1

More Strings

•  Computa2onal genomics (genome rearrangements, gene clusters)


Characters = genes

Problem: How to get from (1,-‐7,6,-‐10, …) to (1,2,3,4,5,…,10) using specified opera2ons

More Strings

•  Computa2onal genomics (genome rearrangements, gene clusters)


Problem: Iden2fy maximal substrings appear-‐ ing together in each genome

Characters = genes

More Strings

•  Mo2f search in protein interac2on networks (joint work with F.Cicalese,T.Gagie, E.Giaquinta,E.S.Laber,R.Rizzi, A.Tomescu, 2013)


Jumbled Patterns and Graph motifs

Paths are like strings

Strings are everywhere

Strings (= sequences) are finite sequences over some alphabet Σ •  DNA/RNA (nucleo2des) Σ = {A,C,G,T} or {A,C,G,U} •  Protein (amino acids): Σ = {A,R,N,D,C,…} •  SNPs/haplotyping (alleles): Σ = {0,1} or {0,1,2} •  computa2onal genomics (genes): Σ ={1,2,3,…,n} •  …


Some major problems on strings

•  Storage and retrieval (compression, succint data structures) -‐> all appl.s

•  PaKern matching (exact, approximate) -‐> database search, iden2fica2on, gene mapping, …

•  Comparing strings (similarity, distance) -‐> phylogene2cs, sequencing, expression studies, …

•  …


Complexity, efficiency!! 1) Storage space 2) computaBon Bme

Some applica2ons

Things I have worked on/am working on

•  Mass spectrometry data interpreta2on (PhD from Bielefeld University, MS research group)

•  Expression clustering (ESTs) (with SANBI, Cape Town, and Wits Univ. Jo’burg, South Africa)

•  Non-‐standard string matching (current) (different collabora2ons)

•  Informa2on extrac2on from and efficient storage of biological data (current) (with G. Franco, V. Manca)



TOPIC 1: Expression clustering

and string similarity measures

Expression clustering

11 CBMC 30/10/2013 Zsuzsanna Lipták

DNA

RNA

mRNA

protein

transcip+on

transla+on

splicing

exon3 exon2 exon1

ESTs are cDNA sequences won from mRNAs (lab process)

transcriptome



Wanted: clusters s.t. each cluster corresponds to a gene

Given: set of sequences (105 or more, length 200-‐800)


Uses: •  expression studies (e.g. diseased vs. healthy) •  gene discovery •  SNP detec2on •  discovery of products of alterna2ve splicing •  es2ma2ng no. of genes

Technology: •  tradi2onally: Sanger style sequencing -‐> ESTs •  recently also: short read sequencing (454, Illumina) 13 CBMC 30/10/2013 Zsuzsanna Lipták


Problem: Given a set of strings S, find a par22on C1...,Ck of S s.t. if s1,s2 “products of the same gene” then s1,s2 in same cluster. What does that mean? Problem: Given a set of strings S, find a par22on C1...,Ck of S s.t.

if s1,s2 are “similar” then s1,s2 in same cluster. Which is the correct similarity measure?


String (dis)similarity measures

Example: s1 = CAAGACAA, s2 = CAGAGCAC

•  alignment / edit distance (Smith-‐Waterman 1981) •  q-‐grams (Ukkonen 1992) •  d2 (Torney et al 1990)

•  others (long exact match, fingerprints, informa2on theory based measures)

alignment vs. non-‐alignment-‐based string similarity


s1: CAAGA-CAA!s2: CA-GAGCAC!

AA 2 0 2 4 AC 1 1 0 0 AG 1 2 1 1 CA 2 2 0 0 GA 1 1 0 0 GC 0 1 1 1 -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ 4 6

= 3 Σw |freq(w,s1) – freq(w,s2)| = 4

Σw (freq(w,s1) – freq(w,s2))2 = 6

Subword-‐based string similarity measures

What’s wrong with alignment? It’s too slow!

•  d2 has been shown to be well suited for EST clustering

(Torney et al 90, Hide et al 99, Hazelhurst et al 2008) •  Claims that d2 beter suited than alignment.


alignment subwords (d2,q-‐grams)

two strings of length n O(n2) O(n) min over all pairs of windows O(n3) O(n2)

ContribuBons (expression clustering)

Joint work with Scot Hazelhurst and others. •  ECLEST: a tool for rigorous comparison of string similarity

measures for EST clustering (masters thesis J. Zimmermann) (IEEE-‐BIBE, 2004)

•  WCD (wicked!): EST clustering tool using d2 (Bioinforma2cs, 2008)

•  KABOOM! (WCD6): new expression clustering tool –  both Sanger style and short read sequences –  uses suffix arrays

–  Bioinforma2cs, 2011 – see next slide


KABOOM!

•  Using subword based similarity, problem s2ll: computa2on 2me!

•  based on pairwise comparison,

Θ(m2n2) 2me inacceptable! (remember: m = 105, 106)

•  Need: heuris2cs to get rid of m2


KABOOM! mul2ple exact

matches

•  KABOOM heuris2c: s1 and s2 must have at least A many k-‐long exact matches at least B apart

•  implementa2on with a modified suffix array of concatenated string s1xs2xs3x...xsm of size O(nm). Reduces no. comparisons to 5% or less! We pay in storage space but s2ll acceptable.



TOPIC 2: Mass spectrometry

and the Money Changing Problem


unknown molecular mixture

?

mass spectrum

AARLSTRACLSAAIS… LSDESMFGHEESLR… SRILSRLELPSGILGG… QEKLHGEERALPSK… ECDNRAALIGRSEDV… …

iden+fica+on

(DNA, protein, metabolites…)

(names, sequences, data base iden+fiers, molecular structure…)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 1000 2000 3000 4000 5000 6000 7000 8000

intensity

mass

Mass spectrometry


Mass spectrometry

•  input: unknown molecular mixture (sample)

•  output: list of masses of the sample molecules (mass in Da, intensity) actually: m/z = mass over charge

•  intensity propor2onal to abundance: how oyen that mass was measured (ideally!)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 1000 2000 3000 4000 5000 6000 7000 8000

intensity

mass


Mass decomposiBon

•  Given: query mass M (in Da) •  Known: What type of molecules

are in sample? (DNA, protein, ... )

•  Ques2on: What molecules can have this mass? 0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 1000 2000 3000 4000 5000 6000 7000 8000

intensity

mass

amino acids (k = 20) nucleotides (k = 4) “bioatoms” (k = 6) Ala (A) 71.079 Da A 313.058 C 12.0 O 15.995

Arg (R) 156.101 C 289.046 H 1.008 P 30.074

Asn (N) 114.043 G 329.053 N 14.003 S 31.972

... T 304.046


Mass decomposiBon

NB: •  output will be: composi2on/sum formula (not: sequence

or molecular structure!) EGAEEYSSFL ⇒ A1E3G1L1F1S2Y1 AACGTAGGAA ⇒ A4C1G3T1

⇒ C10H16N5O13P3 (sum formula)

•  take into account error (measurement, computa2on)


Mass decomposiBon – One applicaBon

•  A professor at the Ins2tute of Pathophysiology at a German university wants to patent a pentapep2de (5 AAs)

•  molecular mass = 609.2 Da •  Patent judge says: from mass follows sequence •  Ques2on: Is this true?


•  Given: •  k coin denomina2ons •  number M

•  Ques2on: How can we make change for M?

The Money Changing Problem (MCP)

5 4 7

5 4 4 7 7 5 19 = 3 ·∙ + 1·∙ = 3 ·∙ + 1·∙ = 2·∙ + 1·∙


TranslaBng the MS problem into MCP

Ala (A) 71.079 Da

Arg (R) 156.101

Asn (N) 114.043

...

7108

15610

11404 precision = 0.01

etc.

Given: query M, error bound ε. Compute all decomposi2ons of masses between M -‐ ε and M + ε (scaled to integers with factor 1/precision).

ContribuBons (MS and MCP)

Joint work with Sebas2an Böcker and others. •  a new algorithm for the Money Changing Problem

(COCOON 2005, ACM-‐SAC 2005, Algorithmica 2007)

•  DECOMP: tool implemen2ng this algo (Bioinforma2cs, 2008) http://bibiserv.techfak.uni-bielefeld.de/decomp/!

•  iden2fica2on of metabolites using isotopic info (WABI 2006)

•  SIRIUS: tool for metabolite iden2fica2on (Bioinforma2cs 2009)

•  AUDENS: a tool for tandem MS spectra (Proteomics 2005)



Back to the pentapepBde story

•  A professor at the Ins2tute of Pathophysiology at a German university wants to patent a pentapep2de (5 AAs)

•  molecular mass = 609.2 Da •  Patent judge says: from mass follows sequence •  Ques2on: Is this true?

Ques+on is, do we agree with paten+ng pep+des? ...

•  no! e.g. QYNAD (120 permuta2ons), AENNY (60), EFNNS (60), DGQQY (60), ... = 7 sum formulas (decomposi2ons) and 600 sequences

•  if length not fixed: 16 decomp’s and 3150 seq’s


TOPIC 3: Non-‐standard string matching


MS: PepBde Mass FingerprinBng (PMF)

? unknown protein

pep2des (fragments)

diges+on

list of masses 185.08, 234.67, 498.08, 515.67, 556.09, 677.98

MS experiment

RLRKRLSSLNEAPPILTPTQ... SPSEHQDSPRIVGGLVGAG... VQPVCLPRSVASSAEPEG.. AAFTPGMLCAGFLEGGTD... LRGIVSWGSGGAHPYIAALY

matching protein?

database lookup


Weighted string matching

Σ finite alphabet, w: Σ −> IN weight (mass) func2on. Weight of string is sum of weights of chars: w(t1...tm) = w(t1) + ... + w(tm). Given text s and a mass M, find all substrings of s with mass M. Example: w(a) = 4, w(b) = 5, w(c) = 7, M = 31.

b b a c a c c a b a b b a a c c a b a c !

31 = a4 + b3 = a3 + b + c2 = a + b4 + c = a6 + c = b2 + c3


Jumbled string matching


Σ finite alphabet, a Parikh vector of string t over Σ counts mul2plici2es of chars, e.g. p(accaba) = (3,1,2). Given text s and a Parikh vector q, find all substrings of s with Parikh vector q. Example: q = (3,1,2) – find all jumbled (permuted) versions of accaba.

Jumbled string matching

Some applica2ons: •  mass spectrometry •  gene clusters •  patern discovery •  filter for exact string matching •  mo2f search in protein interac2on networks



ContribuBons (non-‐standard string matching)

Joint work with … •  algorithms for weighted string matching

(IFIP-‐TCS 2001, COSSAC 2001, Discr Applied Math 2004, CPM 2005, Discr Applied Math 2007)

•  algorithms for jumbled string matching (PSC 2009, FUN 2010, Int. J Found. of Comp. Sc. 2012, Theory of Comp. Systems 2012, IPL 2013, SPIRE 2013)

•  string reconstruc2on (IWOCA 2010, J Discr Alg 2012)

•  language of prefix normal words (DLT 2011)


Summary

•  String problems arising in bioinforma2cs (in general: discrete, combinatorial problems)

•  Many different types of strings in bioinforma2cs

•  Efficient storage, efficient algorithms mater •  Many interes2ng and unsolved problems



AZE!GIR

GRAZIE!

[email protected]

and$other$discrete$problems$$...

Documents

Transcript of and$other$discrete$problems$$...