Algorithms in Mass Spectrometry

download Algorithms in Mass Spectrometry

of 99

Transcript of Algorithms in Mass Spectrometry

  • 7/31/2019 Algorithms in Mass Spectrometry

    1/99

    Algorithmen in der Massenspektrometrie-basierten Proteomforschung

    Algorithms in mass spectrometry

    based proteomics

    Dr. Clemens Gropl

    3 Lectures (28.1., 2.2., 4.2.)

    VL Algorithmische Bioinformatik, Winter Semester 2008/09

    Free University Berlin, http://www.inf.fu-berlin.de/inst/ag-bio/

    1000

    http://www.inf.fu-berlin.de/~groepl/http://www.inf.fu-berlin.de/~groepl/http://www.inf.fu-berlin.de/~groepl/http://lectures.molgen.mpg.de/Algorithmische_Bioinformatik_WS0809/index.shtmlhttp://www.inf.fu-berlin.de/inst/ag-bio/http://www.inf.fu-berlin.de/inst/ag-bio/http://lectures.molgen.mpg.de/Algorithmische_Bioinformatik_WS0809/index.shtmlhttp://www.inf.fu-berlin.de/~groepl/
  • 7/31/2019 Algorithms in Mass Spectrometry

    2/99

    Outline

    1. Mass spectrometry (and liquid chromatography) based methods in proteomics.

    Introduction to experimental methods and important concepts.

    2. Initial data processing of mass spectra and LC-MS data

    Isotopic distributions

    Mass decomposition

    Baseline filtering

    Noise filtering

    Wavelets

    Peak picking in MS and feature finding in LC-MS

    3. Analysis of LC-MS/MS data, identification of peptides and proteins

    1001

  • 7/31/2019 Algorithms in Mass Spectrometry

    3/99

    Isotope distributions

    This exposition is based on:

    R. Martin Smith: Understanding Mass Spectra. A Basic Approach. Wiley, 2nd

    edition 2004. [S04]

    Exact masses and isotopic abundances can be found for example at http:

    //www.sisweb.com/referenc/source/exactmaa.htm or http://education.

    expasy.org/student_projects/isotopident/htdocs/motza.html

    IUPAC Compendium of Chemical Terminology - the Gold Book. http://

    goldbook.iupac.org/ [GoldBook]

    Sebastian Bocker, Zzuzsanna Liptak: Efficient Mass Decomposition. ACM

    Symposium on Applied Computing, 2005. [BL05]

    Christian Huber, lectures given at Saarland University, 2005. [H05]

    Wikipedia:http://en.wikipedia.org/

    ,http://de.wikipedia.org/ 2000

    http://www.sisweb.com/referenc/source/exactmaa.htmhttp://www.sisweb.com/referenc/source/exactmaa.htmhttp://education.expasy.org/student_projects/isotopident/htdocs/motza.htmlhttp://education.expasy.org/student_projects/isotopident/htdocs/motza.htmlhttp://goldbook.iupac.org/http://goldbook.iupac.org/http://en.wikipedia.org/http://de.wikipedia.org/http://de.wikipedia.org/http://en.wikipedia.org/http://goldbook.iupac.org/http://goldbook.iupac.org/http://education.expasy.org/student_projects/isotopident/htdocs/motza.htmlhttp://education.expasy.org/student_projects/isotopident/htdocs/motza.htmlhttp://www.sisweb.com/referenc/source/exactmaa.htmhttp://www.sisweb.com/referenc/source/exactmaa.htm
  • 7/31/2019 Algorithms in Mass Spectrometry

    4/99

    Isotopes

    This lecture addresses some more combinatorial aspect of mass spectrometry re-

    lated to isotope distributions and mass decomposition.

    Most elements occur in nature as a mixture of isotopes. Isotopesare atom species

    of the same chemical element that have different masses. They have the same

    number of protons and electrons, but a different number of neutrons. The main ele-

    ments occurring in proteins are CHNOPS. A list of their naturally occurring isotopes

    is given below.

    Isotope Mass [Da] % Abundance

    1H 1.007825 99.9852H 2.014102 0.015

    12C 12. (exact) 98.9013C 13.003355 1.10

    14N 14.003074 99.6315N 15.000109 0.37

    Isotope Mass [Da] % Abundance

    16O 15.994915 99.7617O 16.999131 0.038

    18O 17.999159 0.20

    31P 30.973763 100.

    32S 31.972072 95.0233S 32.971459 0.7534

    S 33.967868 4.21 2001

  • 7/31/2019 Algorithms in Mass Spectrometry

    5/99

  • 7/31/2019 Algorithms in Mass Spectrometry

    6/99

    Isotopes (3)

    The two isotopes of hydrogen have special names: 1H is called protium, and 2H = D

    is called deuterium(or sometimes heavy hydrogen).

    Note that whereas the exact masses are universal physical constants, the relative

    abundances are different at each place on earth and can in fact be used to trace the

    origin of substances. They are also being used in isotopic labeling techniques.

    The standard unit of mass, the unified atomic mass unit, is defined as 1/12 of the

    mass of 12C and denoted by u or Da, for Dalton. Hence the atomic mass of 12C

    is 12 u by definition. The atomic masses of the isotopes of all the other elements

    are determined as ratios against this standard, leading to non-integral values for

    essentially all of them.

    The subtle differences of masses are due to the mass defect (essentially, the bind-

    ing energy of the nucleus). We will return to this topic later. For understanding the

    next few slides, the difference between nominal and exact masses is not essential.

    2003

  • 7/31/2019 Algorithms in Mass Spectrometry

    7/99

    Isotopes (4)

    The average atomic mass (also called the average atomic weight or just atomic

    weight) of an element is defined as the weighted average of the masses of all itsnaturally occurring stable isotopes.

    For example, the average atomic mass of carbon is calculated as

    (98.9% 12.0 + 1.1% 13.003355)100%

    .= 12.011

    For most purposes such as weighing out bulk chemicals only the average molecular

    mass is relevant since what one is weighing is a statistical distribution of varying

    isotopic compositions.

    The monoisotopic mass is the sum of the masses of the atoms in a molecule using

    the principle isotope mass of each atom instead of the isotope averaged atomic

    mass and is most often used in mass spectrometry. The monoisotopic mass of

    carbon is 12.

    2004

  • 7/31/2019 Algorithms in Mass Spectrometry

    8/99

    Isotopes (5)

    According to the [GoldBook] the principal ionin mass spectrometry is a molecular or

    fragment ion which is made up of the most abundant isotopes of each of its atomicconstituents.

    Sometimes compounds are used that have been artificially isotopically enriched

    in one or more positions, for example CH313CH3 or CH2D2. In these cases the

    principal ion may be defined by treating the heavy isotopes as new atomic species.Thus, in the above two examples, the principal ions would have masses 31 (not 30)

    and 18 (not 16), respectively.

    In the same vein, the monoisotopic mass spectrum is defined as a spectrum con-

    taining only ions made up of the principal isotopes of atoms making up the original

    molecule.

    You will see that the monoisotopic mass is sometimes defined using the lightest

    isotope. In most cases the distinction between principle and lightest isotope

    is non-existent, but there is a difference for some elements, for example iron and

    argon. 2005

  • 7/31/2019 Algorithms in Mass Spectrometry

    9/99

    Isotopic distributions

    The mass spectral peak representing the monoisotopic mass is not always the most

    abundant isotopic peak in a spectrum although it stems from the most abundantisotope of each atom type.

    This is due to the fact that as the number of atoms in a molecule increases the

    probability of the entire molecule containing at least one heavy isotope increases.

    For example, if there are 100 carbon atoms in a molecule, each of which has anapproximately 1% chance of being a heavy isotope, then the whole molecule is not

    unlikely to contain at least one heavy isotope.

    The monoisotopic peak is sometimes not observable due to two primary reasons.

    The monoisotopic peak may not be resolved from the other isotopic peaks. In

    this case only the average molecular mass may be observed.

    Even if the isotopic peaks are resolved, the monoisotopic peak may be below

    the noise level and heavy isotopomers may dominate completely.

    2006

  • 7/31/2019 Algorithms in Mass Spectrometry

    10/99

    Isotopic distributions (2)

    Terminology

    2007

  • 7/31/2019 Algorithms in Mass Spectrometry

    11/99

    Isotopic distributions (3)

    Example:

    2008

  • 7/31/2019 Algorithms in Mass Spectrometry

    12/99

    Isotopic distributions (4)

    To summarize: Learn to distinguish the following concepts!

    nominal mass

    monoisotopic mass

    most abundant mass

    average mass

    2009

  • 7/31/2019 Algorithms in Mass Spectrometry

    13/99

    Isotopic distributions (5)

    Example: Isotopic distribution of human myoglobin

    Screen shot: http://education.expasy.org/student_projects/isotopident/ 2010

    http://education.expasy.org/student_projects/isotopident/http://education.expasy.org/student_projects/isotopident/
  • 7/31/2019 Algorithms in Mass Spectrometry

    14/99

    Isotopic distributions (6)

    A basic computational task is:

    Given an ion whose atomic composition is known, how can we compute its

    isotopic distribution?

    We will ignore the mass defects for a moment. It is convenient to number the peaks

    by their number of additional mass units, starting with zero for the lowest isotopicpeak. We can call this the isotopic rank.

    Let E be a chemical element. Let E[i] denote the probability of the isotope of E

    having i additional mass units. Thus the relative intensities of the isotopic peaks for

    a single atom of element E are (E

    [0], E

    [1], E

    [2], ..., E

    [kE

    ]). Here kE

    denotes

    the isotopic rank of the heaviest isotope occurring in nature. We have E[] = 0 for

    > k.

    For example carbon has C[0] = 98.9% = 0.989 (isotope12C) and C[1] = 1.1% =

    0.011 (isotope 13C).

    2011

  • 7/31/2019 Algorithms in Mass Spectrometry

    15/99

    Isotopic distributions (7)

    The probability that a molecule composed out of one atom of element E and one

    atom of element E has a total of nadditional neutrons isEE[n] =

    ni=0

    E[i] E[n i] .

    Note that EE[] = 0 for > kE + kE.

    This type of composition is very common in mathematics and known as a convolu-tionoperation, denoted by the operator .Using the convolution operator, we can rewrite the above equation as

    EE = E E .

    For example, a hypothetical molecule composed out of one carbon and one nitrogenwould have CN = C N,

    CN[0] = C[0]N[0],

    CN[1] = C[0]N[1] + C[1]N[0],

    CN[2] = C[0]N[2] + C[1]N[1] + C[2]N[0] = C[1]N[1] .2012

  • 7/31/2019 Algorithms in Mass Spectrometry

    16/99

    Isotopic distributions (8)

    Clearly the same type of formula applies if we compose a larger molecule out of

    smaller molecules or single atoms. Molecules have isotopic distributions just likeelements.

    For simplicity, let us define convolution powers. Let 1 := and n := n1 , forany isotopic distribution . Moreover, it is natural to define 0 by 0[0] = 1, 0[] = 0

    for > 0. This way,

    0

    will act as neutral element with respect to the convolutionoperator , as expected.

    Then the isotopic distribution of a molecule with the chemical formula E1n1 En,composed out of the elements E1, . . . , E, can be calculated as

    E1n1En = n1E1 ...

    nE .

    2013

  • 7/31/2019 Algorithms in Mass Spectrometry

    17/99

    Isotopic distributions (9)

    This immediately leads to an algorithm. for computing the isotopic distribution of a

    molecule.

    Now let us estimate the running time for computing n1E1

    ... nlEl.

    The number of convolution operations is n1 + ... + nl 1, which is linear in thenumber of atoms.

    Each convolution operation involves a summation for each [i]. If the highest

    isotopic rank for E is kE, then the highest isotopic rank for En is kEn. Again,

    this in linear in the number of atoms.

    We can improve on both of these factors.

    Do you see how?

    2014

  • 7/31/2019 Algorithms in Mass Spectrometry

    18/99

    Isotopic distributions (10)

    Bounding the range of isotopes

    For practical cases, it is possible to restrict the summation range in the convolutionoperation.

    In principle it is possible to form a molecule solely out of the heaviest isotopes,

    and this determines the summation range needed in the convolution calculations.

    However, the abundance of such an isotopomer is vanishingly small. In fact, it will

    soon fall below the inverse of the Avogadro number (6.0221415 1023), so we willhardly ever see a single molecule of this type.

    For simplicity, we let us first consider a single element E and assume that kE = 1.

    (For example, E could be carbon or nitrogen.) In this case the isotopic distribution

    is a binomial with parameter p := E[1],

    En[k] =n

    k

    pk(1 p)nk .

    The mean of this binomial distribution is pn. Large deviations can be bounded as

    follows.2015

  • 7/31/2019 Algorithms in Mass Spectrometry

    19/99

  • 7/31/2019 Algorithms in Mass Spectrometry

    20/99

    Isotopic distributions (12)

    More generally, a peptide is composed out of the elements C, H, N, O, S. For each

    of these elements the lightest isotope has a natural abundance above 95% and thehighest isotopic rank is at most 2. Again we can bound the sum of the abundances

    of heavy isotopic variants by a binomial distribution:

    ji

    E1n1En

    [j] ji/2

    nj

    0.05j0.95nj .

    (In order to get i additional mass units, at least i/2 of the atoms must be heavy.)

    2017

  • 7/31/2019 Algorithms in Mass Spectrometry

    21/99

    Isotopic distributions (13)

    Computing convolution powers by iterated squaring

    There is another trick which can be used to save the number of convolutions neededto calculate the n-th convolution power n of an elemental isotope distribution.

    Observe that just like for any associative operation , the convolution powers satisfy2n = n n.

    In general, n is not a power of two, so let (bj, bj1, . . . , b0) be the bits of the binaryrepresentation of n, that is,

    n =

    j=0

    b2 = bj2

    j + bj12j1 + + b020 .

    Then we can compute n as follows:

    n =

    j 2jbj =

    j

    2jbj = bj2

    j bj12j1 b020 ,

    where the

    is of course meant with respect to . The total number of convolutions

    needed for this calculation is only O(log n). 2018

  • 7/31/2019 Algorithms in Mass Spectrometry

    22/99

  • 7/31/2019 Algorithms in Mass Spectrometry

    23/99

    Mass decomposition

    A related question is:

    Given a peak mass, what can we say about the elemental composition of the

    ion that generated it?

    In most cases, one cannot obtain much information about the chemical structural

    from just a single peak. The best we can hope for is to obtain the chemical formulawith isotopic information attached to it. In this sense, the total mass of an ion is de-

    composed into the masses of its constituents, hence the term mass decomposition.

    2020

  • 7/31/2019 Algorithms in Mass Spectrometry

    24/99

    Mass decomposition (2)

    This is formalized by the concept of a compomer [BL05].

    We are given an alphabet of size || = k, where each letter has a mass ai,i = 1,..., k. These letters can represent atom types, isotopes, or amino acids, or

    nucleotides. We assume that all masses are different, because otherwise we could

    never distinguish them anyway. Thus we can identify each letter with its mass, i. e.,

    ={

    a1

    , . . . , ak}

    N. This is sometimes called a weighted alphabet.

    The mass of a string s = s1 ... sn is defined as the sum of the masses of itsletters, i. e., mass(s) =

    |s|i=1 si.

    Formally, a compomer is an integer vector c = (c1, . . . , ck) (N0)k. Each cirepresents the number of occurrences of letter ai. The mass of a compomer ismass(c) :=

    ki=1 ciai, as opposed to its length, |c| :=

    ki=1 ci.

    In short: A compomer tells us how many instances of an atomic species are present

    in a molecule. We want to find all compomers whose mass is equal to the observed

    mass.2021

  • 7/31/2019 Algorithms in Mass Spectrometry

    25/99

    Mass decomposition (3)

    There a many more strings (molecules) than compomers, but the compomers are

    still many.

    For a string s = s1 ... sn , we define comp(s) := (c1, . . . , ck), where ci :=#{j | sj = ai}. Then comp(s) is the compomer associated with s, and vice versa.

    One can prove (exercise):

    1. The number of strings associated with a compomer c = (c1, . . . , ck) is |c|

    c1,...,ck

    =

    |c|!c1!ck! .

    2. Given an integer n, the number of compomers c = (c1

    , . . . , ck

    ) with|c|

    = n isn+k1k1

    .

    Thus a simple enumeration will not suffice for larger instances.

    2022

  • 7/31/2019 Algorithms in Mass Spectrometry

    26/99

    Mass decomposition (4)

    Using dynamic programming, we can solve the following problems efficiently

    (: weighted alphabet, M: mass):

    1. Existence problem: Decide whether a compomers c with mass(c) = M exists.

    2. One Witness problem: Output a compomer c with mass(c) = M, if one exists.

    3. All witnesses problem: Compute all compomers c with mass(c) = M.

    2023

  • 7/31/2019 Algorithms in Mass Spectrometry

    27/99

    Mass decomposition (5)

    The dynamic programming algorithm is a variation of the classical algorithm origi-

    nally introduced for the Coin Change Problem, originally due to Gilmore and Go-mory.

    Given a query mass M, a two-dimensional Boolean table Bof size kM is constructed

    such that

    B[i, m] = 1 m is decomposable over {a1, . . . , ai} .The table can be computed with the following recursion:

    B[1, m] = 1 m mod a1 = 0and for i > 0,

    B[i, m] =

    B[i 1, m] m < ai ,B[i 1, m] B[i, m ai] otherwise .

    The table is constructed up to mass M, and then a straight-forward backtracking

    algorithm computes all witnesses of M.

    2024

  • 7/31/2019 Algorithms in Mass Spectrometry

    28/99

    Mass decomposition (6)

    For the Existence and One Witness Problems, it suffices to construct a one-

    dimensional Boolean table A of size M, using the recursion A[0] = 1, A[m] = 0for 1 m < a1; and for m a1, A[m] = 1 if there exists an i with 1 i ksuch that A[m ai] = 1, and 0 otherwise. The construction time is O(kM) and onewitness c can be produced by backtracking in time proportional to |c|, which can bein the worst case 1a1

    M. Of course, both of these problems can also be solved using

    the full table B.

    A variant computes (M), the number of decompositions of M, in the last row, where

    the entries are integers, using the recursion C[i, m] = C[i 1, m] + C[i, m ai].

    The running time for solving the All Witnesses Problem is O(kM) for the table con-

    struction, and O((M) 1a1 M) for the computation of the witnesses (where (M) is thesize of the output set), while storage space is (O(kM).

    2025

  • 7/31/2019 Algorithms in Mass Spectrometry

    29/99

    Mass decomposition (7)

    The number of compomers is O(M). (Exercise: why?) Depending on the mass

    resolution, the results can be useful for M up to, say, 1000 Da, but in general onehas to take further criteria into account. (Figure from [BL].)

    2026

  • 7/31/2019 Algorithms in Mass Spectrometry

    30/99

    Mass decomposition (8)

    Example output from http://bibiserv.techfak.uni-bielefeld.de/decomp/

    # imsdecomp 1.3# Copyright 2007,2008 Informatics for Mass Spectrometry group# at Bielefeld University## http://BiBiServ.TechFak.Uni-Bielefeld.DE/decomp/## precision: 4e-05# allowed error: 0.1 Da

    # mass mode: mono# modifiers: none# fixed modifications: none# variable modifications: none# alphabet (character, mass, integer mass):# H 1.007825 25196# C 12 300000# N 14.003074 350077

    # O 15.994915 399873# P 30.973761 774344# S 31.972071 799302# constraints: none# chemical plausibility check: off## Shown in parentheses after each decomposition:# - actual mass

    # - deviation from actual mass 2027

    http://bibiserv.techfak.uni-bielefeld.de/decomp/http://bibiserv.techfak.uni-bielefeld.de/decomp/
  • 7/31/2019 Algorithms in Mass Spectrometry

    31/99

  • 7/31/2019 Algorithms in Mass Spectrometry

    32/99

    H11 N8 P1 S2 (218.02857; -0.001429668)H7 C15 P1 (218.02854; -0.001463276)H135 C3 N1 S1 (218.03152; +0.00152403)H134 C2 N2 P1 (218.02846; -0.001536192)H18 N3 O2 P2 S2 (218.03157; +0.001566246)H12 C1 N7 S3 (218.03163; +0.001630554)H18 C2 O5 S3 (218.03164; +0.001635776)H7 C4 N6 O3 P1 (218.03172; +0.001724644)H14 C5 O5 S2 (218.02826; -0.001735052)H8 C4 N7 S2 (218.02826; -0.001740274)H14 C3 N3 O2 P2 S1 (218.0282; -0.001804582)H131 C6 N1 (218.02815; -0.001846798)H11 C12 P1 S1 (218.03191; +0.001907552)H10 N7 O3 P2 (218.03204; +0.00203525)H16 C1 O8 P2 (218.03204; +0.002040472)H4 C1 N11 O1 S1 (218.0321; +0.002099558)H10 C2 N4 O6 S1 (218.0321; +0.00210478)H11 C7 N2 O2 P1 S1 (218.02788; -0.002115188)H14 C8 N1 P2 S1 (218.03222; +0.002218158)H13 N1 O10 P1 (218.02771; -0.002292874)H8 C11 N1 O2 S1 (218.02757; -0.002425794)H22 C3 S5 (218.0325; +0.002504204)H17 C4 N2 P3 S1 (218.03253; +0.002528764)H10 C4 O10 (218.0274; -0.00260348)H4 C3 N7 O5 (218.02739; -0.002608702)H6 C10 N2 O4 (218.03276; +0.002756692)H20 N3 P4 S1 (218.03284; +0.00283937)H20 C2 O3 P2 S2 (218.03291; +0.0029089)H14 C3 N4 O1 S3 (218.03297; +0.002973208)H9 C6 N3 O4 P1 (218.03307; +0.003067298)

  • 7/31/2019 Algorithms in Mass Spectrometry

    33/99

    H12 C3 N3 O4 S2 (218.02692; -0.003077706)H12 C1 N6 O1 P2 S1 (218.02685; -0.003147236)H18 N2 O3 P4 (218.02679; -0.003211544)H12 C2 N4 O4 P2 (218.03338; +0.003377904)H6 C3 N8 O2 S1 (218.03344; +0.003442212)H12 C4 N1 O7 S1 (218.03345; +0.003447434)H9 C5 N5 O1 P1 S1 (218.02654; -0.003457842)H15 C4 N1 O3 P3 (218.02648; -0.00352215)H20 C1 N2 P2 S3 (218.02638; -0.00361624)H15 N2 O7 P1 S1 (218.03376; +0.00375804)H6 C9 N4 O1 S1 (218.02623; -0.003768448)H12 C8 O3 P2 (218.02617; -0.003832756)H17 C5 N1 P1 S3 (218.02607; -0.003926846)H8 C2 N3 O9 (218.02605; -0.003946134)H2 C1 N10 O4 (218.02605; -0.003951356)H2 C11 N6 (218.03409; +0.004094124)H22 C2 O1 P4 S1 (218.03418; +0.004182024)H14 C9 S3 (218.02576; -0.004237452)H16 C5 N1 O2 S3 (218.03432; +0.004315862)H5 C7 N7 P1 (218.0344; +0.00440473)H11 C8 O5 P1 (218.03441; +0.004409952)H10 C1 N6 O3 S2 (218.02558; -0.00442036)H16 N2 O5 P2 S1 (218.02552; -0.004484668)H133 C3 O3 (218.02547; -0.004526884)H19 C1 N2 O2 P1 S3 (218.03463; +0.004626468)H8 C3 N8 P2 (218.03472; +0.004715336)H14 C4 N1 O5 P2 (218.03472; +0.004720558)H8 C5 N5 O3 S1 (218.03478; +0.004784866)H13 C4 N1 O5 P1 S1 (218.0252; -0.004795274)H7 C3 N8 P1 S1 (218.0252; -0.004800496)

    ( )

  • 7/31/2019 Algorithms in Mass Spectrometry

    34/99

    H13 C2 N4 O2 P3 (218.02514; -0.004864804)H18 C1 N2 O2 S4 (218.02511; -0.004889364)H139 N1 S2 (218.03489; +0.004894858)H17 N2 O5 P3 (218.03503; +0.005031164)H11 C1 N6 O3 P1 S1 (218.0351; +0.005095472)H10 C8 O5 S1 (218.02489; -0.00510588)H4 C7 N7 S1 (218.02489; -0.005111102)H10 C6 N3 O2 P2 (218.02482; -0.00517541)H15 C9 P1 S2 (218.03528; +0.00527838)H6 N6 O8 (218.02471; -0.005288788)H21 C2 O1 P3 S2 (218.02467; -0.005333808)H131 N5 O1 (218.03536; +0.005363862)

    # done

    Mass defect

  • 7/31/2019 Algorithms in Mass Spectrometry

    35/99

    Mass defect

    The difference between the actual atomic mass of an isotope and the nearest inte-gral mass is called the mass defect. The size of the mass defect varies over the

    Periodic Table. The mass defect is due to the binding energy of the nucleus:

    http://de.wikipedia.org/w/index.php?title=Datei:Bindungsenergie_massenzahl.jpg2028

    Mass defect

    http://de.wikipedia.org/w/index.php?title=Datei:Bindungsenergie_massenzahl.jpghttp://de.wikipedia.org/w/index.php?title=Datei:Bindungsenergie_massenzahl.jpg
  • 7/31/2019 Algorithms in Mass Spectrometry

    36/99

    Mass defect (2)

    The mass differences of light and heavy isotopes are also not exactly multiples of

    the atomic mass unit. We have

    mass (2H) mass (1H) = 1.00628 .= 1mass (13C) mass (12C) = 1.003355 .= 1mass (18O) mass (16O) = 2.004244 .= 2mass (15N)

    mass (14N) = 0.997035

    .= 1

    mass (34S) mass (32S) = 1.995796 .= 2

    These differences (due to the mass defect) are subtle but become perceptible with

    very high resolution mass spectrometers. (Exercise: About which resolution is nec-

    essary?) This is currently an active field of research.

    2029

    Mass defect

  • 7/31/2019 Algorithms in Mass Spectrometry

    37/99

    Mass defect (3)

    High resolution and low resolution isotopic spectra for C15N20S4O8.

    hires.dat lores.dat

    715.909120 100.0 716 100.0

    716.906159 7.2 717 26.8716.908509 3.2716.912479 16.1716.913329 0.2

    717.903189 0.2 718 22.6717.904919 17.6717.905549 0.2717.909520 1.1717.911870 0.5717.913360 1.6717.915830 1.1

    718.901960 1.3 719 5.0718.904310 0.4

    718.908279 2.8718.910399 0.1718.916720 0.2

    719.900719 1.1 720 1.8719.905319 0.2719.909160 0.2719.911629 0.2

    720.904080 0.2 721 0.2

    Isotopic pattern:

    Zoom on +2 mass peak (718):

    2030

    Si l P i

  • 7/31/2019 Algorithms in Mass Spectrometry

    38/99

    Signal Processing

    This exposition is based on:

    Steven W. Smith: The Scientist and Engineers Guide to Digital Signal Process-

    ing, California Technical Publishing San Diego, California, second edition 1999.

    Available at http://www.dspguide.com/ [S99]

    Wikipedia articles:

    http://en.wikipedia.org/wiki/Precision_and_accuracy

    http://de.wikipedia.org/wiki/Pr%C3%A4zision (Prazision)

    Joachim Weickert: Image Processing and Computer Vision, lecture 11, 2004.[W04]

    Christian Huber: Bioanalytik, Vorlesung an der Universitat des Saarlandes,

    2005. [H05]

    3000

    Outline

    http://www.dspguide.com/http://en.wikipedia.org/wiki/Precision_and_accuracyhttp://de.wikipedia.org/wiki/Pr%C3%A4zisionhttp://de.wikipedia.org/wiki/Pr%C3%A4zisionhttp://en.wikipedia.org/wiki/Precision_and_accuracyhttp://www.dspguide.com/
  • 7/31/2019 Algorithms in Mass Spectrometry

    39/99

    Outline

    This lecture is an introduction to some of the signal processing aspects involved in

    the analysis of mass spectrometry data. Signal processing is a large field. Here will

    can only give a glimpse of it.

    Precision, accuracy, and resolution

    Morphological filters for baseline reduction

    Linear filters

    3001

    Precision accuracy and resolution

  • 7/31/2019 Algorithms in Mass Spectrometry

    40/99

    Precision, accuracy, and resolution

    Precision and accuracy.

    The two concepts precision and accuracy are often used interchangeably in non-technical settings, but they have very specific definitions in science engineering,

    engineering and statistics.

    3002

    Precision accuracy and resolution (2)

  • 7/31/2019 Algorithms in Mass Spectrometry

    41/99

    Precision, accuracy, and resolution (2)

    Accuracy is the degree of conformity of a measured or calculated quantity to its

    actual (true) value. Precision is the degree to which further measurements or cal-culations will show the same or similar results. The results of calculations or a

    measurement can be accurate but not precise; precise but not accurate; neither; or

    both. A result is called valid if it is both accurate and precise.

    High accuracy, High precision,but low precision but low accuracy

    Accuracy and precision

    3003

    Precision accuracy and resolution (3)

  • 7/31/2019 Algorithms in Mass Spectrometry

    42/99

    Precision, accuracy, and resolution (3)

    Precisionis usually characterized in terms of the standard deviation of the mea-

    surements. (In German: Prazision.) Precision is sometimes stratified into:

    Repeatability - the variation arising when all efforts are made to keep con-

    ditions constant by using the same instrument and operator, and repeating

    during a short time period.

    In German: innere Genauigkeit einer Messung, veraltet auch Wiederholge-

    nauigkeit die Stabilitat des Messgerats oder seiner Ablesung wahrend des

    Messvorgangs selbst; dies wird durch Fehler- und Ausgleichsrechnung er-

    mittelt nach oftmaligem Wiederholen der Messung unter gleichen Umstan-

    den und mit demselben Messgerat oder Messsystem.

    Reproducibility - the variation arising using the same measurement processamong different instruments and operators, and over longer time periods.

    In German: auere Genauigkeit einer Messung die Streuung der Messun-

    gen, wenn sie unter verschiedenen aueren Umstanden wiederholt werden.

    3004

    Accuracy is related to the difference (bias) between the mean of the measure-

  • 7/31/2019 Algorithms in Mass Spectrometry

    43/99

    y ( )

    ments and the reference value. Establishing and correcting for bias is the task

    of calibration, which corrects for systematic errors of the measurement. Of

    course this requires that a measurement of higher accuracy, or another source

    for the true value, is available!

    In German: absolute Genauigkeit einer Messung der Grad der Ubereinstim-

    mung zwischen angezeigtem und wahrem Wert.

    When deciding which name to call the problem, ask yourself two questions:

    First: Will averaging successive readings provide a better measurement?

    If yes, call the error precision; if no, call it accuracy.

    Second: Will calibration correct the error?If yes, call it accuracy; if no, call it precision.

    Precision, accuracy, and resolution (4)

  • 7/31/2019 Algorithms in Mass Spectrometry

    44/99

    Precision, accuracy, and resolution (4)

    Resolution is yet another concept that is closely related to precision.

    3005

    Precision, accuracy, and resolution (5)

  • 7/31/2019 Algorithms in Mass Spectrometry

    45/99

    Precision, accuracy, and resolution (5)

    The accuracy (in German: Massengenauigkeit) is often reported using a pseudo-

    unit of ppm(parts per million): Let

    m[u] = mmeasured mtheoretical .Then

    m/m[ppm] =mmeasured mtheoretical

    mtheoretical 106[ppm].

    Like accuracy, the resolution is a dimensionless number.

    3006

    Morphological filters

  • 7/31/2019 Algorithms in Mass Spectrometry

    46/99

    p g

    Mathematical morphology is a relatively new branch of mathematics. It was

    founded around 1965 at the Ecole Normale Superieure des Mines in Fontainebleau

    near Paris (according to [W04]).

    Morphological methods do only take into account the level sets

    Li(f) := { (x, y) | f(x, y) i}of an image f.

    Hence the results are invariant under all strictly monotonous transformations. Mor-

    phological methods are well-suited for analyzing the shapeof objects this is in fact

    what motivates the name morphology. The concept of level sets is related to the

    rank transformation used in statistics.

    Morphological filters provide a nonlinear alternative to the linear filters which we will

    learn about later.

    3007

    Morphological filters (2)

  • 7/31/2019 Algorithms in Mass Spectrometry

    47/99

    p g ( )

    Why is filtering, and signal processing in general, so important for computational

    mass spectrometry? A typical mass spectrum can be decomposed into three addi-

    tive terms with different frequency range:

    Information: The real signal we are interested in, e.g. an isotopic pattern

    caused by a peptide. Medium frequency.

    Baseline: A broad trend, for example caused by signals from matrix ions whenMALDI is used. Very low frequency, should not change much within 5 Th.

    Noise: Very high frequency, e.g. detector noise; (hopefully) not even correlated

    among consecutive sample points in the raw data.

    3008

  • 7/31/2019 Algorithms in Mass Spectrometry

    48/99

    Morphological filters (4)

  • 7/31/2019 Algorithms in Mass Spectrometry

    49/99

    p g

    Erosion.

    Intuitively, the erosion is obtained by moving the structuring element (in German:

    Strukturelement) within the area under the signal and marking the area covered

    by the reference point.

    3010

    Morphological filters (5)

  • 7/31/2019 Algorithms in Mass Spectrometry

    50/99

    Dilation.

    The dilation is defined similarly. This time we move the reference point within the

    area under the signal and mark the area covered by the structuring element.

    In German this is called Dilatation, but Dilation also seems to be in use.

    3011

    Morphological filters (6)

  • 7/31/2019 Algorithms in Mass Spectrometry

    51/99

    The mathematical definition can be given in very general terms, but in our case,

    for the top hat filter, we will only consider a flat structuring elements. Thus the

    structuring element B is a symmetric interval around zero, and zero is the reference

    point. Therefore the definitions are as simple as:

    Dilation: (f B)(x) := max{ f(x x) | x B}.

    Erosion: (f B)(x) := min{ f(x + x) | x B}.

    Since for any set X, min{x X} = max{x | x X}, it is sufficient to explainthe algorithm for dilation (the max case).

    3012

    Morphological filters (7)

  • 7/31/2019 Algorithms in Mass Spectrometry

    52/99

    Trivial algorithm

    Assume that the signal to be processed is x0

    , x1

    , . . . , xn1

    , the size of the structuring

    element is p > 1, and we want to compute the max filter:

    yi := max0j

  • 7/31/2019 Algorithms in Mass Spectrometry

    53/99

    (Disclaimer: The algorithms of Van Herk and Gil-Werman have a slightly different

    definition of segments. So the following images should only be considered exact

    up to 1.)

    3014

  • 7/31/2019 Algorithms in Mass Spectrometry

    54/99

  • 7/31/2019 Algorithms in Mass Spectrometry

    55/99

  • 7/31/2019 Algorithms in Mass Spectrometry

    56/99

  • 7/31/2019 Algorithms in Mass Spectrometry

    57/99

    Morphological filters (13)

  • 7/31/2019 Algorithms in Mass Spectrometry

    58/99

    Separability

    We consider the case where we have two dimensions and compute the dilation.

    The proof for higher dimensions and/or erosion is similar.

    Let f be the signal and B be the structuring element. We can assume that the

    reference point is 0. As we have seen, the value of the dilation of f at a certain

    position (x, y) is just the maximum of the original signal within the region:

    (f B)(x, y) = max{f(x x, y y) | (x, y) B} .Now consider the case where B is an axis-parallel (!) rectangle B = Bx By. Whatwould be the result of applying the one-dimensional dilation to both dimensions one

    after another?

    3019

  • 7/31/2019 Algorithms in Mass Spectrometry

    59/99

  • 7/31/2019 Algorithms in Mass Spectrometry

    60/99

  • 7/31/2019 Algorithms in Mass Spectrometry

    61/99

  • 7/31/2019 Algorithms in Mass Spectrometry

    62/99

  • 7/31/2019 Algorithms in Mass Spectrometry

    63/99

    Morphological filters (19)

  • 7/31/2019 Algorithms in Mass Spectrometry

    64/99

    More nice mathematical properties:

    The opening and closing satisfies the following inequalities:

    f B f opening signalf B f closing signal

    Multiple openings or closings with the same structuring element do not alter thesignal any more; we have

    (f B) B = f B opening twice is same as opening(f B) B = f B closing twice is same as closing

    (Proof: Exercise.)

    3025

    Morphological filters (20)

  • 7/31/2019 Algorithms in Mass Spectrometry

    65/99

    To show that the opening satisfies the inequality fB = fBB f, let us assumethat the contrary holds for some argument x. Let y := f

    B

    B(x) > f(x). Then by

    definition of there exists an x B such that f B(x x) = y. But the definitionof implies that for all x B it holds f((x x) + x) f B(x x). If we setx = x, then f(x) f B(x x) = y > g(x), a contradiction. In the same way,one can prove that the closing satisfies the inequality f B f.

    To show that the opening satisfies f B B = f B, we prove that in fact a slightlystronger assertion holds: If g = h B is the result of a dilation by B, then g B = g.(In our case, h = fB.) By the preceding inequality, it suffices to show that gB g.Let x be arbitrary, then the claim is that g B B(x) g(x). By the definition of, there is an x B such that g(x) = h(x x). But then by the definition of wealso have h B((x x) + x) h(x x) for all x B. Hence by the definition of, we have h B B(x x) h(x x). Again by the definition of , we haveh B B B((x x) + x) h(x x) = g(x) for all x B. Now let x = x. Itfollows that h B B B(x) g(x), as claimed. In the same way, one can provethat the closing satisfies f

    B

    B = f

    B.

    3026

  • 7/31/2019 Algorithms in Mass Spectrometry

    66/99

    Morphological filters (22)

  • 7/31/2019 Algorithms in Mass Spectrometry

    67/99

    input: red, erosion: green, opening: blue, tophat: yellow

    (From the documentation of class OpenMS::MorphologicalFilter.) http://www.

    openms.de

    3028

    http://www.openms.de/http://www.openms.de/http://www.openms.de/http://www.openms.de/
  • 7/31/2019 Algorithms in Mass Spectrometry

    68/99

  • 7/31/2019 Algorithms in Mass Spectrometry

    69/99

    Signal processing basics

  • 7/31/2019 Algorithms in Mass Spectrometry

    70/99

    A signal is a description of how one parameter varies with another parameter.

    Signals can be continuous or discrete.

    A system is any process that takes an input signal and produces an output

    signal.

    Depending on the type of input and output, one can distinguish continuous

    systems, such as analog electronics, and discrete systems, such as computer

    programs that manipulate the values stored in arrays.

    Converting a continuous signal into a discrete signal introduces sampling and

    quantization errors, due to the discretization of the abscissa (x-axis) and ordi-

    nate (y-axis).

    3031

  • 7/31/2019 Algorithms in Mass Spectrometry

    71/99

    Signal processing basics (3)

  • 7/31/2019 Algorithms in Mass Spectrometry

    72/99

    From a theoretical point of view, sampling is necessary since we cannot store (or

    process in digital form) an infinite amount of data.

    The intensity values are often reported as ion counts but this does not mean that

    we do not have a quantization step here! Firstly, we should keep in mind that these

    counts are the result of internal calculations performed within the instrument (e.g.,

    the accumulation of micro-scans). Secondly, a mass spectrometer is simply not

    designed to count single ions but rather to measure the concentration of substratesin a probe.

    3033

  • 7/31/2019 Algorithms in Mass Spectrometry

    73/99

    Signal processing basics (5)

    L t X (i) d t th ti ti f th i th l W id X

  • 7/31/2019 Algorithms in Mass Spectrometry

    74/99

    Let X(i) denote the quantization error of the i-th sample. We consider X as a ran-

    dom variable defined on the sample numbers. It is uniformly distributed on the

    interval [12, +12], where := LSB . The probability density function of X is

    f(x) =

    1 for |x| 120 otherwise

    By elementary calculus, we have

    E(X) =

    12

    12x

    dx =

    1

    1

    2x2

    x=2

    x=2=

    1

    1

    2

    (

    2)2 (

    2)2

    = 0

    Thus V(X) = E(X2) E(X)2 = E(X2). Again by elementary calculus, we have

    E(X2) =121

    2

    x2

    dx = 1

    3x3

    x=2

    x=2= 1

    3

    38

    ()38

    =

    2

    12

    Hence the standard deviation of the quantization error X is /

    12 = LSB /

    12.

    3035

    Linear systems

    A linear s stems ti fi th f ll i g ti

  • 7/31/2019 Algorithms in Mass Spectrometry

    75/99

    A linear systems satisfies the following properties:

    Homogeneity: If an input signal of x[n] results in an output signal of y[n], thenan input of k x[n] results in an output of k y[n], for any input signal x and con-

    stant k. This kind of operation (multiplication by a scalar) is also called scaling.

    Additivity: If an input of x1[n] produces an output of y1[n], and a different input,

    x2[n], produces another output, y2[n], then the system is said to be additive, ifan input of x1[n] + x2[n] results in an output of y1[n] + y2[n], for all possible input

    signals x1, x2. In words: signals added at the input produce signals that are

    added at the output.

    Shift invariance: A shift in the input signal will result in an identical shift in theoutput signal. In formal terms, if an input signal of x[n] results in an output of

    y[n], an input signal of x[n+s] results in an output of y[n+s], for any input signal

    and any constant, s.

    3036

  • 7/31/2019 Algorithms in Mass Spectrometry

    76/99

  • 7/31/2019 Algorithms in Mass Spectrometry

    77/99

    Linear systems (3)

    Synthesis decomposition and superposition

  • 7/31/2019 Algorithms in Mass Spectrometry

    78/99

    Synthesis, decomposition and superposition

    Synthesis: Signals can be combined by scaling (multiplication of the signalsby constants) followed by addition. The process of combining signals through

    scaling and addition is called synthesis.

    For example, a mass spectrum can be composed out of baseline, a number of

    actual peaks, and white noise.

    Decomposition is the inverse operation of synthesis, where a single signal is

    broken into two or more additive components. This is more involved than syn-

    thesis, because there are infinite possible decompositions for any given signal.

    3038

  • 7/31/2019 Algorithms in Mass Spectrometry

    79/99

    Linear systems (5)

  • 7/31/2019 Algorithms in Mass Spectrometry

    80/99

    System

    Decomposition Synthesis

    System

    System

    System 3040

    Linear systems (6)

    The trivial, but important observation here is:

  • 7/31/2019 Algorithms in Mass Spectrometry

    81/99

    The trivial, but important observation here is:

    The output signal obtained by superposition of the components is identicalto the one produced by directly passing the input signal through the system.

    Thus instead of trying to understanding how complicated signals are changed by a

    system, all we need to know is how simple signals are modified.

    There are two main ways to decompose signals in signal processing: impulse de-

    compositionand Fourier decomposition.

    3041

    Fourier methods

    Fourier decomposition

  • 7/31/2019 Algorithms in Mass Spectrometry

    82/99

    The Fourier decomposition (named after Jean Baptiste Joseph Fourier (1768-1830),

    a French mathematician and physicist) uses cosine and sine functions as compo-nent signals:

    ck[i] := cos(2ki/N) and sk[i] := sin(2ki/N)

    where k = 0,..., N/2 and N is the number of sample positions. (It turns out that

    s0 = sN/2 = 0, so there are in fact only N components.)

    Fourier based methods perform best for periodic signals. They are less suitable

    for peak-like signals, e. g. mass spectra or elution profiles. We will not go into fur-

    ther details here. Fourier theory is important in the analysis of FTICR mass spec-

    trometers (FTICR = Fourier transform ion cyclotron resonance) and Orbitrap mass

    spectrometers.

    3042

    Linear systems (2)

    Impulse Decomposition

  • 7/31/2019 Algorithms in Mass Spectrometry

    83/99

    p p

    The impulse decomposition breaks an N samples signal into N component signals,each containing N samples. Each of the component signals contains one point

    from the original signal, with the remainder of the values being zero. Such a single

    nonzero point in a string of zeros is called an impulse.

    The delta function is a normalized impulse. It is defined by

    [n] :=

    1, n = 00, otherwise.

    The impulse response of a linear system, usually denoted by h[n], is the output of

    the system when the input is a delta function.

    3043

  • 7/31/2019 Algorithms in Mass Spectrometry

    84/99

  • 7/31/2019 Algorithms in Mass Spectrometry

    85/99

    Linear systems (5)

    Moving average filter

  • 7/31/2019 Algorithms in Mass Spectrometry

    86/99

    Gaussian filter

    3046

  • 7/31/2019 Algorithms in Mass Spectrometry

    87/99

    Linear systems (7)

    And this shows the convolution with a Gaussian kernel.

  • 7/31/2019 Algorithms in Mass Spectrometry

    88/99

    3048

  • 7/31/2019 Algorithms in Mass Spectrometry

    89/99

    Wavelets (2)

    To adjust the width of the Marr wavelet, one introduces a scaling parameter a. Thus

    th i l ld b (x )/

  • 7/31/2019 Algorithms in Mass Spectrometry

    90/99

    the impulse response would be (xa)/

    a.

    3050

    Wavelets (3)

    Using wavelets, we can solve two problems at the same time:

  • 7/31/2019 Algorithms in Mass Spectrometry

    91/99

    The integral of a wavelet is zero. Therefore a constant baseline has no influenceon the output.

    High frequency noise is also filtered out.

    3051

  • 7/31/2019 Algorithms in Mass Spectrometry

    92/99

    Wavelets (5)

    A comparison of filter kernels: Moving average, Gauss, Marr

  • 7/31/2019 Algorithms in Mass Spectrometry

    93/99

    3053

  • 7/31/2019 Algorithms in Mass Spectrometry

    94/99

  • 7/31/2019 Algorithms in Mass Spectrometry

    95/99

    Averagines

    But how can we know the isotopic pattern to use for the wavelet?

  • 7/31/2019 Algorithms in Mass Spectrometry

    96/99

    If one plots the atomic content of proteins in some protein database (e.g. SwissProt)

    it becomes evident, that the number of atoms for each type grows roughly linearly.

    The picture shows on the x-axis the molecular weight and on the y-axis the number

    of atoms of a type.

    0

    50

    100

    150

    200

    250

    300

    0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

    "C stats"

    "C mean"

    "N stats"

    "N mean"

    3056

    Averagines (2)

    Since the number of C,N, and O atoms grows about linearly with the mass of the

    molecule it is clear that the isotope pattern changes with mass

  • 7/31/2019 Algorithms in Mass Spectrometry

    97/99

    molecule it is clear that the isotope pattern changes with mass.

    mass [0] [1] [2] [3] [4]

    1000 0.55 0.30 0.10 0.02 0.00

    2000 0.30 0.33 0.21 0.09 0.03

    3000 0.17 0.28 0.25 0.15 0.08

    4000 0.09 0.20 0.24 0.19 0.12

    Since there is a very nice linear relationship between peptide mass and its

    atomic composition, we can estimate the average composition for peptide of a

    given mass.

    Given the atomic composition of a peptide, we can compute the relative inten-sities of its peaks in a mass spectrum, the isotopic pattern.

    We can use this knowledge for feature detection i.e. to summarize isotopic

    pattern into peptide features and to separate them from noise peaks.3057

  • 7/31/2019 Algorithms in Mass Spectrometry

    98/99

  • 7/31/2019 Algorithms in Mass Spectrometry

    99/99