Data Acquisition and Analysis in Mass Spectrometry Based Metabolomics
Metabolomics Data Analysis
-
Upload
cost-action-bm1006 -
Category
Documents
-
view
132 -
download
1
description
Transcript of Metabolomics Data Analysis
SeqAhead, Barcelona February 2013
Johan A. Westerhuis
Swammerdam Institute for Life Sciences, University of Amsterdam
Business Mathematics and Information,
North-West University, Potchefstroom, South Africa
Metabolomics Data Analysis
egra
3
Metabolomics pipeline : Issues for biostatistics
Biological
question Data
acquisition Experimental
design
Data
Pre-
processing
Metabolite
identification
Biological
inter-
pretation
Statistical
Data
analysis
Power analysis Treatment design
Explorative Predictive Hypotheticalbiomarkers QC strategy
Measurement design
Normalisation Quantification
Spectral matching De NOVO indentification
Network inference, MSEA, Pathway analysis
Data Analysis special issue Metabolomics
• Data preprocessing methods (make samples more comparable)
• How to treat non-detects
• Variable importance in multivariate models
• Metabolic network analysis
• Data fusion methods
• Individual responses
• Between metabolite ratio’s
Guest Editors Jeroen J. Jansen Johan A. Westerhuis
Multivariate metabolomics data
Technical correlations Biological correlations
Biological correlations
TARGETED ANALYSIS
NONTARGETED PROFILING
hipp fum urea allant TMAO citrat
3 1 8 7 13 4
5 2 12 6 15 2
1 67 45 6 3 31 10 44 32 10
3 24 12 4 33 23 0 0 99 76
Multivariate Metabolomics Data analysis
• Explorative – Find groups, clusters structure /
outliers in metabolites and in samples
• Supervised – Discriminate two or more groups to
make predictive model and to find biomarkers.
• Biological Interpretation – Metabolite set enrichment,
Pathway analysis
– Metabolic network inference
• Special topics – Between metabolite
ratios
– Metabolomics Data Fusion
Metabolomics Data preprocessing
7
• Optimize biological content of data
• Correct for incorrect sampling, sample
workup issues, batch effects
• What is the noise level in the data?
Variance stabilization.
• High peaks more important than low
peaks?
• Multivariate methods love large values!
Generalized log transform
Metabolic changes during E. coli culture growth using k-means clustering.
Szymanski, Jedrzej et al. PLoS ONE (2009), vol. 4 issue. 10
(A) Growth curve (optical density) of unperturbed E. coli culture. Numbers of
respective sampling time points are marked in the curve. Time point 0 minutes
marks the application of the respective stress condition.
(B) Relative changes of metabolites pools normalized time point 1. Fold change is
presented on log10 scale. To reveal main trends of metabolic changes
10 K means clusters are color coded.
meta
bolit
es
time
Self Organising Map of Metabolites in serum
1H NMR spectra of 613 patients with type I diabetes and a diverse spread of complications Nonlinear mapping method for large number of samples. Relate position on the map to diagnostic responses. Can be made supervised
1H NMR metabonomics approach to the disease continuum of diabetic complications and premature death VP Mäkinen et al, Molecular Systems Biology 4:167, 2008
Multivariate Metabolomics Data analysis
• Explorative – Find groups, clusters structure /
outliers in metabolites and in samples
• Supervised (Differentially expressed)
– Discriminate two or more groups to make predictive model and to find biomarkers.
• Biological Interpretation – Metabolite set enrichment, Pathway
analysis
– Metabolic network inference
• Special topics – Between metabolite
ratios
– Metabolomics Data Fusion
Supervised Metabolomics Data analysis Case – Control (PLSDA)
Y
0
0
0
1
1
1
• Is there really a difference between the groups ?
Statistical validation issues
• Which are the most important peaks for discrimination ?
Variable importance
-4 -2 0 2 4 6-3
-2
-1
0
1
2
3
4
PC1
PC
2
Men
Women
00.511.522.533.54-0.06
-0.04
-0.02
0
0.02
0.04
Chemical shift (ppm)
bP
LS
• Psyhogios example uitleggen met paper voorbeelden en metaboanalyst voorbeelden
13
Proton NMR spectra of the urine samples were obtained
on a 500MHz 1H NMR machine.
NMR spectra of urine samples
14
UNIVERSITY OF
AMSTERDAM 15
Nonsupervised
Supervised
Experimental Design Example
Experiment:
Time: 6, 24 and 48 hours
Experimental Design:
Rats are given Bromobenzene that affects the liver
Groups: 3 doses of BB
Animals: 3 rats per dose per time point
Vehicle group, Control group
Rats
6 hours
24 hours
48 hours
0 2 4 6 8 10
chemical shift (ppm)
2.93 2.7175 2.075
3.7525 3.675
3.0475 5.38 3.285
2.055
3.0275
Measurements: NMR spectroscopy of urine
Different contributions
0 0.2 0.4 0.6 0.8 1 -0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
time
Met
abo
lite
con
cen
trat
ion
0 0.2 0.4 0.6 0.8 1 time
Time
Animal
0 0.2 0.4 0.6 0.8 1 time
Dose
0 0.2 0.4 0.6 0.8 1 time
Trajectories
Experimental Design
ANOVA decomposition of each variable
hkhk hkihkkhkix
αβγαβα
T XXX1mX
MATRICES:
αβγαβα
T XXX1mX
MATRICES:
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 -0.5 0
0.5 1
1.5 2
2.5 3
3.5 4
ANOVA and PCA ASCA
E
Parts of the data not explained by the component models
αβγαβα
T XXX1mX
EPTPTPT1mX T
αβγαβγ
T
αβαβ
T
αα
T
X
Tαβγ
PαβγPαβPα
TαβTα
X
Tαβγ
PαβγPαβPα
TαβTα
X
Tαβγ
PαβγPαβPα
TαβTα
Results
40 %
Xα
Xαβ
XαβγXαXα
XαβXαβ
Xαβγ
6 24 48
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
Time (Hours)
Sco
res
control vehicle low medium high
αβ -scores
Results biomarkers
3.9675 2.735
3.675 3.7525
2.055 2.5425
5.38 3.0475
2.5825 2.6975
3.9675 2.735 2.6975
2.93 3.0275 2.91
2.5825
2.075
3.285
2.055
3.8875 3.73 2.055 3.0475 2.93 2.075
2.735
3.0275
3.2625
3.285
0 2 4 6 8 10 chemical shift (ppm)
αβγ
αβ
αDifferences between submodels
Interesting for Biology
Interesting for Statistics / Diagnostics
Unique to the α submodel
Multivariate Metabolomics Data analysis
• Explorative – Find groups, clusters structure /
outliers in metabolites and in samples
• Supervised – Discriminate two or more groups to
make predictive model and to find biomarkers.
– Method comparison
• Biological Interpretation – Metabolite set enrichment
– Pathway analysis
– Metabolic network inference
• Special topics – Between metabolite
ratios
– Metabolomics Data Fusion
SELDI measurements of serum samples of 20 Gaucher patients and 20 healthy controls. Gaucher is a genetic disease in which a fatty substance (lipid) accumulates in cells and certain organs
NONTARGETED
• human urine and porcine cerebrospinal fluid samples spiked with a range of peptides.
• Variation in #samples, within and between group variation
Gaucher Spiked
Feature selection methods RESULTS
• Complex nontargeted Gaucher profiling data with highly variable background and varying difference between case and control: Multivariate methods perform best.
• Spiked LCMS targeted data with less variation in effect size: univariate and semi-univariate methods are best in selecting biomarkers.
Multivariate Metabolomics Data analysis
• Explorative – Find groups, clusters structure /
outliers in metabolites and in samples
• Supervised – Discriminate two or more groups to
make predictive model and to find biomarkers.
• Biological Interpretation – Metabolite set enrichment,
Pathway analysis
– Metabolic network inference
• Special topics – Between metabolite
ratios
– Metabolomics Data Fusion
Biomarkers: A: Univariate B: Multivariate C: Change in group correlation
BMR of green tea intervention study 186 human subjects with abdominal obesity
Validation shows significant changes in BMR between placebo and green tea treatment together with most important triacylglycerols TG28-29 and TG41-42.
Multivariate Metabolomics Data analysis
• Explorative – Find groups, clusters structure /
outliers in metabolites and in samples
• Supervised – Discriminate two or more groups to
make predictive model and to find biomarkers.
• Biological Interpretation – Metabolite set enrichment
Pathway analysis
– Metabolic network inference
• Special topics – Between metabolite
ratios
– Metabolomics Data Fusion
Plasma
Differences in blood metabolites due to aging
Aging biomarker metabolites in liver
Multivariate Metabolomics Data analysis
• Explorative – Find groups, clusters structure /
outliers in metabolites and in samples
• Supervised – Discriminate two or more groups to
make predictive model and to find biomarkers.
• Biological Interpretation – Metabolite set enrichment
Pathway analysis
– Metabolic network inference
• Special topics – Between metabolite
ratios
– Metabolomics Data Fusion
Special topic: Metabolic networks Biochemical Network vs Association Network
Margriet M.W.B. Hendriks , Data-processing strategies for metabolomics studies, Trends in Analytical Chemistry, 20212
Figure 7 Marginal correlation network for a set of metabolites in tomato. Volatiles in red, derivatized metabolites in yellow. Solid lines represent positive correlations, dashed lines negative ones. Thickness of line corresponds to magnitude of ...
Metabolomics, 2005
Metabolic neighbors Do not participate in common reactions
High correlation due to e.g. chemical equilibrium, mass conservation,..
Data from Potato tubers
“a systematic relationship between observed correlation networks and the underlying biochemical pathways.” Ralf Steuer: Observing and interpreting correlations in metabolomic networks, Bioinformatics, 2003
A
B
D
C
E F ?? A
B
D
C
E F
Search for the link between metabolome data and underlying metabolic
networks.
As an example: can we distinguish healthy from diseased networks:
Metabolic Network Inference
Glucose A B C
D
E F
G
F
G
Glucose A B C
D
E F
G
F
G
HEALTHY DISEASE
Glucose A B C
D
E F
G
F
G
Glucose A B C
D
E F
G
F
G
40
Goal: ?
?
Problems:
NOISE MISSING METABOLITES
NETWORK
TOPOLOGY
DIRECTIONS
HUGE AMOUNT OF POSSIBLE
NETWORK STRUCTURES
From data to network
2b. Conditioned Networks
Partial Pearson Correlation (PPC) (linear)
Conditional Mutual Information (CMI) (non-linear)
A
B C
E F
D
A
B
D
C
E F
2a. Relevance Networks
Pearson Correlation (PC) (linear)
Mutual Information (MI) (non-linear)
A
B
D
C
E F
A
B
D
C
E F
0 1 2 3 40
1
2
3
4
5
0 1 2 3 40
1
2
3
4
5
0.2 0.4 0.6 0.8
0
0.5
1
1.5
2
0.2 0.4 0.6 0.8
0
0.5
1
1.5
2
ALL POSSIBLE
PAIRWISE
INTERACTIONS
1. DATA COLLECTION 2. SIMILARITY SCORE CALCULATION
0.2 0.4 0.6 0.8
0
0.5
1
1.5
2
0 1 2 3 40
1
2
3
4
5
0 50 100
0 50 100
0 2 4 6 8
0.4
0.5
0.6
0.7
0.8
0.9
1
A. Enzymatic Variability
100 200 300 400 500 600 700 800 900 1000
0.35
0.4
0.45
0.5
0.55
0.6
B. Intrinsic Variability
C. Environmental Variability
10 20 30 40 50
Inference from static data
42
1. ASPP 2. ASA 3. HS 4. HSP Real Pathway
PC ASA
PPCn ASPP ASA HS HSP
MI ASPP ASA HSP HS
CMI1 ASPP ASA HSP
PC ASPP ASA HS HSP
MI ASPP ASA HS HSP
PPCn ASPP ASA HS HSP
CMI1 ASPP ASA HS HSP
PC
MI
PPCn ASPP ASA HS HSP
CMI1
ASPP ASA HS HSP
ASPP ASA HS HSP
ASPP ASA HS HSP
ASPP HS HSP
Vmax Variability Intrinsic Variability Environmental Variability
PPC1 ASPP ASA HS HSP PPC1 ASPP ASA HS HSP
PPC1 ASPP ASA HS HSP
HS
PC: Pearson Correlation (linear measure) MI: Entropy-based Mutual Information (non-linear measure) PPC: Partial Pearson Correlation (linear conditioning measure) CMI: Conditional Mutual Information (nonlinear conditioning measure)
100%
> 90%
10% … 90%
< 10%
Cakir, Metabolomics 2009
ESTIMATION OF CORRELATION NETWORKS
Multivariate Metabolomics Data analysis
• Explorative – Find groups, clusters structure /
outliers in metabolites and in samples
• Supervised – Discriminate two or more groups to
make predictive model and to find biomarkers.
• Biological Interpretation – Metabolite set enrichment
Pathway analysis
– Metabolic network inference
• Special topics – Between metabolite
ratios
– Metabolomics Data Fusion
• Account for between-block difference in quality of measurements to improve data fusion
• For example, multi-platform data fusion, with differences in quantification, (non) targeted, error structure
• How to quantify the quality of measurements with many metabolites, and many samples?
Amino acids
Lipids
Fused data
Metabolomics data fusion
Error model for 1 metabolite
• Error models:
- RSD using 1 QC sample
- 2-component
using study samples
• Good error description
- sufficient # samples
- large -range
A
M
I
S
Mean Intensity
I
QC sample -> RSD
study samples
Sta
ndard
Devia
ton S
t.D
Figure of merit for data from 1 platform
Median: F-50 = 0.1
90th-percentile: F-90 = 0.35
I
Var. 15 Var. 365
Var. 118
Var. 213
F-50 F-90
Nu
mb
er
of
pe
aks
(Van Batenburg et al. Analytical Chemistry, 2011)
St.
D
GC/MS LC/MS
M
M
j
Ij
J2= 49 peaks
J1=
82
peaks
• Step 1:
Compute figures of merit for each platform
Two-step data fusion
Two-step data fusion: MB-MLPCA • Step 2 : Multi-block PCA with weighting by figures of merit
Amino acids Lipids
X1
X2
ˆ js
2
Fused error
covariance
• Method needs good estimation of error variance by – Repeats
– QC samples
Realistic simulations using GCMS and
LCMS data
• Error variance estimated from duplicates
• True error variance
• Estimating variance from duplicates is problematic.
• Use Mix of QC samples and repeats.
Multivariate Metabolomics Data analysis
• Explorative – Find groups, clusters structure /
outliers in metabolites and in samples
• Supervised – Discriminate two or more groups to
make predictive model and to find biomarkers.
• Biological Interpretation – Metabolite set enrichment
Pathway analysis
– Metabolic network inference
• Special topics – Between metabolite
ratios
– Metabolomics Data Fusion