Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e....

33
Analysis of Mass Spectrometry Data: Problems and Tools Johan Carlson [email protected] Div. of Systems and Interaction Dept. of Computer Science, Electrical and Space Engineering Lule ˚ a University of Technology SE-971 87 Lule ˚ a Sweden

Transcript of Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e....

Page 1: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

Analysis of MassSpectrometry Data:Problems and Tools

Johan [email protected]

Div. of Systems and InteractionDept. of Computer Science, Electrical and SpaceEngineeringLulea University of TechnologySE-971 87 LuleaSweden

Page 2: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

2

Today’s menuBackground

Mass spectrometryTraditional multivariate data analysisProblems

ToolsPre-processingTraditional analysis, re-visitedProblems?Alternative analysis strategy

Future challenges

Page 3: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

3

Mass spectrometry (MS)Analytical technique that measures the mass-to-charge ratio ofcharged particles.

Used for:Determining masses of particles,Determining the elemental composition of a sample or moleculeRevealing chemical structures of molecules and compounds.

Works by ionizing chemical compounds to generate chargedmolecules or molecule fragments and measuring theirmass-to-charge ratios

Page 4: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

4

Mass spectrometry (MS)A sample is loaded onto the MS instrument and undergoesvaporization.

The components of the sample are ionized by one of a variety ofmethods (e.g., by impacting them with an electron beam), whichresults in the formation of charged particles (ions).

The ions are separated according to their mass-to-charge ratio in ananalyzer by electromagnetic fields.

The ions are detected, usually by a quantitative method

The ion signal is processed into mass spectra.

Page 5: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

5

Mass spectrometry (MS)

(Source: wikipedia.org)

Page 6: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

6

Mass spectrometry (MS)The output:

A vector of mass-to-charge values, i.e. the location of the peaksin the mass spectrum.Abundance values, i.e. the peaks themselves, representing therelative abundance ("amount") of each ion in the sample.This has some implications (causing problems!), but let’s leavethese for now.

The location of the peaks (i.e. the corresponding mass value) giveinformation of what type of molecules are present.

The magnitude of the peaks give information of the relative amount ofeach molecule.

The next slide shows an example of a mass spectrum for a reasonablysimple peptide.

Page 7: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

7

Mass spectrometry (MS)

(Source: wikipedia.org)

Page 8: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

8

Mass spectrometry (MS)For more complex mixtures, the mass spectra become more difficultto interpret.

The following example of a mass spectrum of a crude oil sample istaken from:J. E. Carlson, J. R. Gasson, T. Barth, and I. Eide, "ExtractingHomologous Series from Mass Spectrometry Data by Projection onPredefined Vectors", Chemom. Intell. Lab. Syst., Vol. 114, pp. 36–43,2012.

Page 9: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

9

Mass spectrometry (MS)

100 200 300 400 500 6000

10

20

30

40

50

m/z [Da]

no

rma

lise

da

bu

nd

an

ce

198.

221

2.2

226.

2

Page 10: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

10

Traditional multivariate analysisPurpose: Reveal underlying patterns in large data sets.

Example: Look at a set of mass spectra from 10 different oil samples.How are these different?

Tool of choice (among chemists): Principal Component Analysis(PCA).

So, let’s first look at what PCA is!

Page 11: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

11

Principal component analysis (PCA)Assume we make an observation xm, where

xm =[x1 x2 · · · xN

]T,

where x1, x2, . . . , xM are measured quantities for different variables.

If we have M such multivariate observations, we can store these in amatrix X as

X =

⎡⎢⎢⎢⎣

xT1

xT2...

xTM

⎤⎥⎥⎥⎦

Page 12: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

12

Principal component analysis (PCA)Now let’s factor X, as

X = TPT ,

where the columns of P are now the normalized eigenvectors ofXXT , i.e. a new basis for the column space of X constructed fromthe eigenvectors of the sample covariance matrix of our Mobservations (in N variables). The rows of T are then thecoordinates in this new basis.

Furthermore, let the eigenvectors be sorted so that the first column ofP is the eigenvector corresponding to the largest eigenvalue, thesecond column corresponds to the second largest eigenvalue, and soon.

WHY IS THIS GOOD?

Page 13: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

13

Principal component analysis (PCA)Example

Let’s look at a two-dimensional case, where x1 and x2 are observationsfrom a two-dimensional Gaussian random variable with covariance matrix

R =

[10 1.5

1.5 0.5

]

Page 14: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

14

Principal component analysis (PCA)

−10 −5 0 5 10−2

−1

0

1

2

x1

x 2

Page 15: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

15

Principal component analysis (PCA)

−10 −5 0 5 10−2

−1

0

1

2

x1

x 2

Page 16: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

16

Principal component analysis (PCA)In essence, PCA is a rotation of the coordinate system.

The axes of the new system describe directions in which we havelarge experimental variation.

If there are strong correlations in the original data, we can thereforereduce the dimensionality by discarding PC’s, with minimum loss ofinformation (actually optimal, in the least-squares sense).

Page 17: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

17

Principal component analysis (PCA)

−10 −5 0 5 10

−1

0

1

p1

p 2

Page 18: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

18

Problems with mass spectrometry dataExample

Assume we have mass spectra of 10 different crude oils (fivereplicates of each).

A PCA should be able to reveal differences between these.

So, let’s store mass spectra from the samples as rows of our matrixX (columns then represent mass/charge values).

Large variations between oil samples should show up, and similaroils should group together.

Let’s try it!

Page 19: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

19

Problems with mass spectrometry data

−100 −80 −60 −40 −20 0 20 40−50

−40

−30

−20

−10

0

10

20

30

40

50

p1

p 2

Page 20: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

20

Problems with mass spectrometry dataIt doesn’t work! Why?

Problem no. 1:In PCA, we assume all columns represent different variables, but thatthese variables are the same for all rows.

The MS data are non-uniformly sampled, meaning that we obtainpairs of mass/charge values and abundance values, only where thereare peaks.

So, storing all data in one matrix, each column does not representthe same thing for the different spectra.

Problem no. 2Uncertainties in the instrument causes peak locations to shift slightly.

So, even for replicate experiments, the mass/charge values will notbe the same.

Is PCA doomed?

Page 21: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

21

Pre-processing of MS dataIt appears as if we need to do some pre-processing of the spectra beforePCA can be applied.

1. Re-sampling of mass spectra so that they share one commonmass/charge vector.

2. Taking the uncertainty of the instrument into account, i.e. aligningpeaks from different spectra that can be assumed to have the samemass/charge value.

So, if we do this (takes a bit of programming...), what do we get?

Page 22: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

22

Traditional PCA, re-visited

−60 −40 −20 0 20 40 60 80−70

−60

−50

−40

−30

−20

−10

0

10

20

30

p1

p 2

Page 23: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

23

Traditional PCA, re-visitedSo, it appears we have overcome the main problems. Now:

Replicate experiments on the same oils group together.

Oils with different chemical compositions are separated.

Oils with different, but somewhat similar properties, appear closer toeach other in the plot.

Remaining problem:

The underlying chemical properties are hard to find. The newrepresentation reveals patterns, but these are hard to interpret.

Page 24: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

24

Alternative analysis strategyLet’s go back to the example mass spectrum:

100 200 300 400 500 6000

10

20

30

40

50

m/z [Da]

no

rma

lise

da

bu

nd

an

ce

198.

221

2.2

226.

2

Page 25: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

25

Alternative analysis strategyObservation

It appears there are series of peaks, separated by a fixedmass/charge value.

A separation of n× 14 would mean the molecule has n extra CH2

groups.

Idea

Could we analyze the spectra in terms of series like these instead ofeigenvectors of the covariance matrix (PCA)?

How would we take uncertainties of the instrument into account whendoing this?

Page 26: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

26

Alternative analysis strategyLet’s construct a new orthonormal basis for the spectra

basis

vecto

rs,

ui

m/z [Da]mm

+4�

m+

8�

m-

4�

m+

14

peak width, �

u1

u2

u3

Page 27: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

27

Alternative analysis strategyWe can now project our spectra onto this new basis, by

T = UTX,

where U are the vectors from the previous slides and T are the scoresobtained by the projection, i.e. "how much of each basis function ispresent in each of the spectra"

Page 28: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

28

Alternative analysis strategy

46

810

12

24

2

4

t2

t2

t2

t3

t3

t3

4 6 8 10 12 14

2

4

6

t1

t1t1

4 6 8 10 12 14

2

4

6

2 4 61

2

3

4

5

6

F01oF02t

F03tF04t

F05tF06o

F07oF08t

F09tF10t

(a) (b)

(c)

(d)

Page 29: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

29

Alternative analysis strategyObservations

Replicates of the same oil group nicely.

Oils with similar chemical composition appear close to each other.

Chemically different oils will be separated.

So far we can see the same things as with PCA. So, what else?

Let’s look at the corresponding basis vectors.

Page 30: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

30

Alternative analysis strategy

190 200 210 220 2300

0.1

0.2

m/z [Da]ba

sis

vect

or, u

1

192.

1 ± 0

.1

206.

1 ± 0

.1

220.

1 ± 0

.1

190 200 210 220 2300

0.1

0.2

m/z [Da]

basi

s ve

ctor

, u2

198.

1 ± 0

.1

212.

1 ± 0

.1

226.

1 ± 0

.1

190 200 210 220 2300

0.1

0.2

m/z [Da]

basi

s ve

ctor

, u3

190.

1 ± 0

.1

204.

1 ± 0

.1

218.

1 ± 0

.1

Page 31: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

31

Alternative analysis strategyObservations

Looking at the mass/charge values of the peaks, a trained chemistcan determine what chemical compound class these sequencescorrespond to.

In other words, in addition to the ability to discriminate chemicallydifferent oils samples, we can also interpret what type of chemicalcompounds that causes this difference.

Page 32: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

32

Future challengesHow can we model the original spectra based on this new analysismethod?

How to model how changing a process variable (in the preparation ofthe oil) will affect the composition?

We still need to develop various diagnostic and visualization tools toaid the chemist in the analysis of the results.

Page 33: Analysis of Mass Spectrometry Data: Problems and Tools · A vector of mass-to-charge values, i.e. the location of the peaks in the mass spectrum. Abundance values, i.e. the peaks

Thank you!