Metabolomics Data Analysis

SeqAhead, Barcelona February 2013

Johan A. Westerhuis

Swammerdam Institute for Life Sciences, University of Amsterdam

Business Mathematics and Information,

North-West University, Potchefstroom, South Africa

Metabolomics Data Analysis

egra

http://www.nbic.nl/

3

Metabolomics pipeline : Issues for biostatistics

Biological

question Data

acquisition Experimental

design

Data

Pre-

processing

Metabolite

identification

Biological

inter-

pretation

Statistical

Data

analysis

Power analysis Treatment design

Explorative Predictive Hypotheticalbiomarkers QC strategy

Measurement design

Normalisation Quantification

Spectral matching De NOVO indentification

Network inference, MSEA, Pathway analysis

Data Analysis special issue Metabolomics

• Data preprocessing methods (make samples more comparable)

• How to treat non-detects

• Variable importance in multivariate models

• Metabolic network analysis

• Data fusion methods

• Individual responses

• Between metabolite ratio’s

Guest Editors Jeroen J. Jansen Johan A. Westerhuis

Multivariate metabolomics data

Technical correlations Biological correlations

Biological correlations

TARGETED ANALYSIS

NONTARGETED PROFILING

hipp fum urea allant TMAO citrat

3 1 8 7 13 4

5 2 12 6 15 2

1 67 45 6 3 31 10 44 32 10

3 24 12 4 33 23 0 0 99 76

Multivariate Metabolomics Data analysis

• Explorative – Find groups, clusters structure /

outliers in metabolites and in samples

• Supervised – Discriminate two or more groups to

make predictive model and to find biomarkers.

• Biological Interpretation – Metabolite set enrichment,

Pathway analysis

– Metabolic network inference

• Special topics – Between metabolite

ratios

– Metabolomics Data Fusion

Metabolomics Data preprocessing

7

• Optimize biological content of data

• Correct for incorrect sampling, sample

workup issues, batch effects

• What is the noise level in the data?

Variance stabilization.

• High peaks more important than low

peaks?

• Multivariate methods love large values!

Generalized log transform

Metabolic changes during E. coli culture growth using k-means clustering.

Szymanski, Jedrzej et al. PLoS ONE (2009), vol. 4 issue. 10

(A) Growth curve (optical density) of unperturbed E. coli culture. Numbers of

respective sampling time points are marked in the curve. Time point 0 minutes

marks the application of the respective stress condition.

(B) Relative changes of metabolites pools normalized time point 1. Fold change is

presented on log10 scale. To reveal main trends of metabolic changes

10 K means clusters are color coded.

meta

bolit

es

time

Self Organising Map of Metabolites in serum

1H NMR spectra of 613 patients with type I diabetes and a diverse spread of complications Nonlinear mapping method for large number of samples. Relate position on the map to diagnostic responses. Can be made supervised

1H NMR metabonomics approach to the disease continuum of diabetic complications and premature death VP Mäkinen et al, Molecular Systems Biology 4:167, 2008

http://www.google.nl/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=18qyUHBYQDdU9M&tbnid=n4OaWzZdAvUwcM:&ved=0CAUQjRw&url=http://www.nature.com/msb/journal/v4/n1/fig_tab/msb4100205_F4.html&ei=IMoSUbqqEs3a0QWR7oHYDA&bvm=bv.41934586,d.d2k&psig=AFQjCNGqhYnjvsEYMwQ0FO6eDd6_pXHrPw&ust=1360272251427478




• Supervised (Differentially expressed)

– Discriminate two or more groups to make predictive model and to find biomarkers.

• Biological Interpretation – Metabolite set enrichment, Pathway

analysis



ratios


Supervised Metabolomics Data analysis Case – Control (PLSDA)

Y

0

0

0

1

1

1

• Is there really a difference between the groups ?

Statistical validation issues

• Which are the most important peaks for discrimination ?

Variable importance

-4 -2 0 2 4 6-3

-2

-1

0

1

2

3

4

PC1

PC

2

Men

Women

00.511.522.533.54-0.06

-0.04

-0.02

0

0.02

0.04

Chemical shift (ppm)

bP

LS

• Psyhogios example uitleggen met paper voorbeelden en metaboanalyst voorbeelden

13

Proton NMR spectra of the urine samples were obtained

on a 500MHz 1H NMR machine.

NMR spectra of urine samples

14

UNIVERSITY OF

AMSTERDAM 15

Nonsupervised

Supervised

Experimental Design Example

Experiment:

Time: 6, 24 and 48 hours

Experimental Design:

Rats are given Bromobenzene that affects the liver

Groups: 3 doses of BB

Animals: 3 rats per dose per time point

Vehicle group, Control group

Rats

6 hours

24 hours

48 hours

0 2 4 6 8 10

chemical shift (ppm)

2.93 2.7175 2.075

3.7525 3.675

3.0475 5.38 3.285

2.055

3.0275

Measurements: NMR spectroscopy of urine

Different contributions

0 0.2 0.4 0.6 0.8 1 -0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

time

Met

abo

lite

con

cen

trat

ion

0 0.2 0.4 0.6 0.8 1 time

Time

Animal

0 0.2 0.4 0.6 0.8 1 time

Dose

0 0.2 0.4 0.6 0.8 1 time

Trajectories

Experimental Design

ANOVA decomposition of each variable

hkhk hkihkkhkix

αβγαβα

T XXX1mX

MATRICES:

αβγαβα

T XXX1mX

MATRICES:

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 -0.5 0

0.5 1

1.5 2

2.5 3

3.5 4

ANOVA and PCA ASCA

E

Parts of the data not explained by the component models

αβγαβα

T XXX1mX

EPTPTPT1mX T

αβγαβγ

T

αβαβ

T

αα

T

X

Tαβγ

PαβγPαβPα

TαβTα

X

Tαβγ

PαβγPαβPα

TαβTα

X

Tαβγ

PαβγPαβPα

TαβTα

Results

40 %

Xα

Xαβ

XαβγXαXα

XαβXαβ

Xαβγ

6 24 48

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

Time (Hours)

Sco

res

control vehicle low medium high

αβ -scores

Results biomarkers

3.9675 2.735

3.675 3.7525

2.055 2.5425

5.38 3.0475

2.5825 2.6975

3.9675 2.735 2.6975

2.93 3.0275 2.91

2.5825

2.075

3.285

2.055

3.8875 3.73 2.055 3.0475 2.93 2.075

2.735

3.0275

3.2625

3.285

0 2 4 6 8 10 chemical shift (ppm)

αβγ

αβ

αDifferences between submodels

Interesting for Biology

Interesting for Statistics / Diagnostics

Unique to the α submodel






– Method comparison

• Biological Interpretation – Metabolite set enrichment

– Pathway analysis



ratios


SELDI measurements of serum samples of 20 Gaucher patients and 20 healthy controls. Gaucher is a genetic disease in which a fatty substance (lipid) accumulates in cells and certain organs

NONTARGETED

• human urine and porcine cerebrospinal fluid samples spiked with a range of peptides.

• Variation in #samples, within and between group variation

Gaucher Spiked

Feature selection methods RESULTS

• Complex nontargeted Gaucher profiling data with highly variable background and varying difference between case and control: Multivariate methods perform best.

• Spiked LCMS targeted data with less variation in effect size: univariate and semi-univariate methods are best in selecting biomarkers.






• Biological Interpretation – Metabolite set enrichment,

Pathway analysis



ratios


Biomarkers: A: Univariate B: Multivariate C: Change in group correlation

BMR of green tea intervention study 186 human subjects with abdominal obesity

Validation shows significant changes in BMR between placebo and green tea treatment together with most important triacylglycerols TG28-29 and TG41-42.







Pathway analysis



ratios


Plasma

Differences in blood metabolites due to aging

Aging biomarker metabolites in liver







Pathway analysis



ratios


Special topic: Metabolic networks Biochemical Network vs Association Network

Margriet M.W.B. Hendriks , Data-processing strategies for metabolomics studies, Trends in Analytical Chemistry, 20212

Figure 7 Marginal correlation network for a set of metabolites in tomato. Volatiles in red, derivatized metabolites in yellow. Solid lines represent positive correlations, dashed lines negative ones. Thickness of line corresponds to magnitude of ...

Metabolomics, 2005

Metabolic neighbors Do not participate in common reactions

High correlation due to e.g. chemical equilibrium, mass conservation,..

Data from Potato tubers

“a systematic relationship between observed correlation networks and the underlying biochemical pathways.” Ralf Steuer: Observing and interpreting correlations in metabolomic networks, Bioinformatics, 2003

A

B

D

C

E F ?? A

B

D

C

E F

Search for the link between metabolome data and underlying metabolic

networks.

As an example: can we distinguish healthy from diseased networks:

Metabolic Network Inference

Glucose A B C

D

E F

G

F

G

Glucose A B C

D

E F

G

F

G

HEALTHY DISEASE

Glucose A B C

D

E F

G

F

G

Glucose A B C

D

E F

G

F

G

40

Goal: ?

?

Problems:

NOISE MISSING METABOLITES

NETWORK

TOPOLOGY

DIRECTIONS

HUGE AMOUNT OF POSSIBLE

NETWORK STRUCTURES

From data to network

2b. Conditioned Networks

Partial Pearson Correlation (PPC) (linear)

Conditional Mutual Information (CMI) (non-linear)

A

B C

E F

D

A

B

D

C

E F

2a. Relevance Networks

Pearson Correlation (PC) (linear)

Mutual Information (MI) (non-linear)

A

B

D

C

E F

A

B

D

C

E F

0 1 2 3 40

1

2

3

4

5

0 1 2 3 40

1

2

3

4

5

0.2 0.4 0.6 0.8

0

0.5

1

1.5

2

0.2 0.4 0.6 0.8

0

0.5

1

1.5

2

ALL POSSIBLE

PAIRWISE

INTERACTIONS

1. DATA COLLECTION 2. SIMILARITY SCORE CALCULATION

0.2 0.4 0.6 0.8

0

0.5

1

1.5

2

0 1 2 3 40

1

2

3

4

5

0 50 100

0 50 100

0 2 4 6 8

0.4

0.5

0.6

0.7

0.8

0.9

1

A. Enzymatic Variability

100 200 300 400 500 600 700 800 900 1000

0.35

0.4

0.45

0.5

0.55

0.6

B. Intrinsic Variability

C. Environmental Variability

10 20 30 40 50

Inference from static data

42

1. ASPP 2. ASA 3. HS 4. HSP Real Pathway

PC ASA

PPCn ASPP ASA HS HSP

MI ASPP ASA HSP HS

CMI1 ASPP ASA HSP

PC ASPP ASA HS HSP

MI ASPP ASA HS HSP


CMI1 ASPP ASA HS HSP

PC

MI


CMI1

ASPP ASA HS HSP

ASPP ASA HS HSP

ASPP ASA HS HSP

ASPP HS HSP

Vmax Variability Intrinsic Variability Environmental Variability

PPC1 ASPP ASA HS HSP PPC1 ASPP ASA HS HSP

PPC1 ASPP ASA HS HSP

HS

PC: Pearson Correlation (linear measure) MI: Entropy-based Mutual Information (non-linear measure) PPC: Partial Pearson Correlation (linear conditioning measure) CMI: Conditional Mutual Information (nonlinear conditioning measure)

100%

> 90%

10% … 90%

< 10%

Cakir, Metabolomics 2009

ESTIMATION OF CORRELATION NETWORKS







Pathway analysis



ratios


• Account for between-block difference in quality of measurements to improve data fusion

• For example, multi-platform data fusion, with differences in quantification, (non) targeted, error structure

• How to quantify the quality of measurements with many metabolites, and many samples?

Amino acids

Lipids

Fused data

Metabolomics data fusion

Error model for 1 metabolite

• Error models:

- RSD using 1 QC sample

- 2-component

using study samples

• Good error description

- sufficient # samples

- large -range

A

M

I

S

Mean Intensity

I

QC sample -> RSD

study samples

Sta

ndard

Devia

ton S

t.D

Figure of merit for data from 1 platform

Median: F-50 = 0.1

90th-percentile: F-90 = 0.35

I

Var. 15 Var. 365

Var. 118

Var. 213

F-50 F-90

Nu

mb

er

of

pe

aks

(Van Batenburg et al. Analytical Chemistry, 2011)

St.

D

GC/MS LC/MS

M

M

j

Ij

J2= 49 peaks

J1=

82

peaks

• Step 1:

Compute figures of merit for each platform

Two-step data fusion

Two-step data fusion: MB-MLPCA • Step 2 : Multi-block PCA with weighting by figures of merit

Amino acids Lipids

X1

X2

ˆ js

2

Fused error

covariance

• Method needs good estimation of error variance by – Repeats

– QC samples

Realistic simulations using GCMS and

LCMS data

• Error variance estimated from duplicates

• True error variance

• Estimating variance from duplicates is problematic.

• Use Mix of QC samples and repeats.







Pathway analysis



ratios


Metabolomics Data Analysis

Documents

Transcript of Metabolomics Data Analysis