Characterization of metabolomic signatures in septic shock ... · Lo shock settico è una delle...

Characterization of metabolomic signatures in septic shock patients: a data

mining approach

XXIX Cycle 2014 – 2017

POLITECNICO DI MILANO

DEPARTMENT OF ELECTRONICS, INFORMATION AND BIOENGINEERING

DOCTORAL PROGRAM IN BIOENGINEERING

Doctoral dissertation of: Alice Cambiaghi

Advisors: Dr. Manuela FERRARIO Prof. Giuseppe BASELLI Dr. Roberta PASTORELLI Tutor: Prof. Manuela Teresa RAIMONDI Supervisor of the PhD Program: Prof. Andrea ALIVERTI

To Davide

Abstract

Septic shock is one of the major complication in critically ill patients with a mortality rate

reaching 40%, a high-risk of second line treatments and long term physical and cognitive

impairments in survivors. Current treatments in septic shock are mainly devoted to restore

homeostasis and prevent multiple organ failure by administrating fluids and vasoactive agents to

avoid prolonged hypotension. Despite significant improvements in clinical care, accurate diagnosis

and risk stratification for septic shock patients remains a challenge and clinicians are still far to have

found the optimal therapy. In this contest, information at molecular or cellular level provided by

omics analyses are of great importance for the development of new therapeutic targets and to

follow individual response to therapy. Thanks to the recent technological advances in high-

throughput omics analyses, this type of data is becoming more and more accessible, giving origin to

huge and heterogeneous datasets. In this framework, the interest in metabolomics has increased

since metabolites represents the terminal downstream products of the genome and consist of the

total complement of all low-molecular-weight molecules that cellular processes leave behind.

Hence, metabolomics studies are very promising to model complex and multifactorial syndromes,

such as septic shock, and may be a promising tool toward personalized medicine. In spite of these

progresses, the management of metabolomics data is still an open challenge. Data mining and

machine learning approaches have been recently applied in this context, but several aspects have

still to be explored in order to have reliable tools.

The objective of this thesis is the exploration of machine learning and data mining techniques

for metabolomics data analysis and multilevel integration in two septic shock patient cohorts

selected from ALBIOS and ShockOmics clinical trials. We focused on a homogeneous and well

defined group of patients in the same condition (i.e. severe septic shock) and on a short temporal

window (i.e. 48 hours or one week after diagnosis). The models obtained highlighted the role of

lipids, alanine and plasmalogens. The identified pathways could be a further step in the

comprehension of the complex mechanisms, currently still under study, involved in the

pathogenesis and progression of septic shock.

Sommario

Lo shock settico è una delle maggiori complicanze che possono insorgere nei pazienti di

terapia intensiva; è caratterizzato da un tasso di mortalità che sfiora il 40% e da gravi conseguenze

a lungo termine che implicano un deterioramento sia cognitivo che fisico. Attualmente, il

trattamento dello shock settico è volto a ripristinare l’omeostasi e a prevenire il fallimento

multiorgano tramite la somministrazione endovenosa di fluidi e farmaci vasopressori così da

normalizzare la pressione. Nonostante i significativi miglioramenti degli ultimi anni, la stratificazione

dei pazienti in shock settico sulla base del rischio già al momento della diagnosi iniziale, risulta

tuttora difficoltosa e si è ancora lontani dall’aver trovato il trattamento ottimale. In questo contesto,

le informazioni fornite dai dati omici riguardo ai meccanismi coinvolti a livello molecolare e cellulare

risultano fondamentali per permettere sia l’identificazione di nuovi target terapeutici sia il

monitoraggio continuo della risposta di ogni paziente alla terapia. Grazie ai recenti progressi della

tecnologia nell’ambito delle analisi omiche, le informazioni riguardo a questo tipo di dati sono

diventate sempre più accessibili, dando così origine a database molto estesi ed eterogenei. In

particolare, sta crescendo l’interesse per la metabolomica, dovuto al fatto che i metaboliti

costituiscono i prodotti finali del metabolismo cellulare e possono quindi fornire indicazioni preziose

sui processi molecolari in atto in un determinato istante. Gli studi di metabolomica risultano quindi

particolarmente promettenti per modellare una sindrome complessa come lo shock settico e

potrebbero costituire un ottimo punto di partenza verso lo sviluppo di modelli per la medicina

personalizzata. Tuttavia, nonostante questi progressi, il trattamento e l’analisi dei dati omici

costituisce ancora una sfida. Di recente, tecniche di data mining e di machine learning sono state

utilizzate in questo ambito ma c’è ancora molto da esplorare prima di poter ottenere dei metodi

affidabili.

Alla luce di quanto qui espresso, l’obiettivo di questa tesi è lo studio di tecniche di machine

learning per l’integrazione multiscala di dati omici in due coorti di pazienti in shock settico

selezionate dagli studi clinici ALBIOS e ShockOmics. Ci siamo concentrati su gruppi omogenei di

pazienti nella stessa condizione (shock settico grave) e su un limitato intervallo temporale (48 ore o

una settimana dopo la diagnosi). I modelli ottenuti mettono in luce il ruolo di lipidi, alanina e

plasmalogeni. I processi biologici identificati potrebbero costituire un passo avanti nella

comprensione dei complessi meccanismi, tuttora in corso di studio, coinvolti nella patogenesi e nella

progressione dello shock settico.

Table of Contents List of abbreviations …………………………………………………………………………………………………………………………………… i

Summary …………………………………………………………………………………………………………………………………………………… iii

1 INTRODUCTION ......................................................................................................................................... 1

1.1 SHOCK ................................................................................................................................................ 2

1.1.1 Definition, description and causes ............................................................................................ 2

1.1.2 Pathophysiology ........................................................................................................................ 3

1.2 SEPTIC SHOCK .................................................................................................................................... 4

1.2.1 Definition and incidence ............................................................................................................ 4

1.2.2 Pathophysiology ........................................................................................................................ 5

1.2.3 Treatment of septic shock ....................................................................................................... 11

1.2.4 The SOFA score ........................................................................................................................ 12

1.3 OMICS DATA .................................................................................................................................... 13

1.3.1 Metabolomics .......................................................................................................................... 14

1.3.2 Proteomics ............................................................................................................................... 19

1.4 MOTIVATIONS AND OBJECTIVES OF THE STUDY ............................................................................. 23

1.4.1 Thesis outline ........................................................................................................................... 25

2 DATA MINING METHODS FOR OMICS DATA ANALYSIS AND INTEGRATION ........................................ 26

2.1 THE MULTIPLE TESTING PROBLEM .................................................................................................. 26

2.2 LINEAR AND LOGISTIC REGRESSION MODELS ................................................................................. 27

2.2.1 Shrinkage methods .................................................................................................................. 29

2.3 FEATURE REDUCTION ...................................................................................................................... 32

2.3.1 The minimum-redundancy maximum-relevance (mRMR) algorithm ..................................... 34

2.4 LINEAR METHODS FOR CLASSIFICATION ......................................................................................... 35

2.4.1 Linear Discriminant Analysis (LDA) .......................................................................................... 36

2.4.2 Partial Least Squares Discriminant Analysis (PLS-DA) ............................................................. 37

2.5 PERFORMANCE EVALUATION .......................................................................................................... 39

2.6 PROBABILISTIC GRAPHICAL MODELS ............................................................................................... 40

Table of contents

2.7 FINAL CONSIDERATIONS .................................................................................................................. 44

3 MORTALITY PREDICTION FOR SEVERE SEPTIC SHOCK PATIENTS: A TARGETED METABOLOMICS STUDY ON ALBIOS DATABASE .................................................................................................................................... 46

3.1 INTRODUCTION ............................................................................................................................... 47

3.2 MATERIAL AND METHODS .............................................................................................................. 48

3.2.1 Study design, patients and clinical data .................................................................................. 48

3.2.2 Univariate analyses for metabolomics data ............................................................................ 48

3.2.3 Multivariate analysis ................................................................................................................ 49

3.3 RESULTS ........................................................................................................................................... 49

3.3.1 Clinical characteristics of the study population ...................................................................... 49

3.3.2 Time-course of plasma metabolites and association with mortality ...................................... 51

3.3.3 Association between metabolic patterns and mortality ......................................................... 53

3.3.4 Integrated clinical and metabolomics determinants of mortality .......................................... 54

3.4 DISCUSSION ..................................................................................................................................... 55

3.5 REMARKS ......................................................................................................................................... 58

4 INTEGRATION OF METABOLOMICS AND PROTEOMICS: AN ANCILLARY STUDY ON ALBIOS DATABASE 59

4.1 INTRODUCTION ............................................................................................................................... 59

4.2.2 Proteomics data analyses ........................................................................................................ 60

4.2.3 Statistical analyses ................................................................................................................... 61

4.3 RESULTS ........................................................................................................................................... 64

4.3.2 Changes in protein expressions between groups .................................................................... 65

4.3.3 Time trend variation of proteins and metabolites .................................................................. 67

4.4 EXPLORATIVE ANALYSES .................................................................................................................. 78

4.5 DISCUSSION ..................................................................................................................................... 81

Table of contents

4.6 REMARKS ......................................................................................................................................... 82

5 CHARACTERIZATION OF A METABOLOMIC PROFILE ASSOCIATED WITH RESPONSIVENESS TO THERAPY IN THE ACUTE PHASE OF SEPTIC SHOCK ........................................................................................ 84

5.1 INTRODUCTION ............................................................................................................................... 85

5.2.2 Statistical analysis .................................................................................................................... 87

5.2.3 Data from targeted metabolomics analysis ............................................................................ 89

5.2.4 Multivariate analyses............................................................................................................... 90

5.3 RESULTS ........................................................................................................................................... 91

5.3.2 Metabolic fingerprinting by untargeted metabolomics .......................................................... 94

5.3.3 Metabolic profiling by targeted metabolomics ....................................................................... 96

5.3.4 Regression analysis for targeted metabolomics data ........................................................... 100

5.3.5 Regression models for targeted and untargeted metabolomics data .................................. 102

5.3.6 Discriminant analysis ................................................................................................................. 104

5.4 EXPLORATIVE ANALYSES ................................................................................................................ 106

5.5 DISCUSSION ................................................................................................................................... 110

5.6 REMARKS ....................................................................................................................................... 112

6 DISCUSSION AND CONCLUSIONS ......................................................................................................... 113

6.1 MAIN FINDINGS ............................................................................................................................. 113

6.2 LIMITS AND CLINICAL IMPACT OF THE STUDY............................................................................... 114

6.3 FUTURE DEVELOPMENTS .............................................................................................................. 116

A METABOLOMICS ANALYSES ................................................................................................................. 119

A.1 UNTARGETED METABOLOMICS BY FLOW INJECTION-TOF-MS ..................................................... 119

A.1.1 Samples preparation ............................................................................................................. 119

A.1.2 Flow Injection-TOF MS/MS .................................................................................................... 119

A.1.3 MS Data Processing ............................................................................................................... 119

A.1.4 Metabolite identification ....................................................................................................... 120

Table of contents

A.2 TARGETED METABOLOMICS .......................................................................................................... 120

B PROTEOMICS ANALYSES BY iTRAQ QUANTITATION ........................................................................... 123

B.1 STUDY DESIGN ............................................................................................................................... 123

B.2 SAMPLE PREPARATION .................................................................................................................. 123

B.2.1 Human Plasma depletion ...................................................................................................... 123

B.2.2 In solution sample digestion .................................................................................................. 123

B.2.3 Peptide labeling ..................................................................................................................... 124

B.2.4 Sample clean-up and fractionation ....................................................................................... 124

B.3 LC-MS/MS ANALYSES ..................................................................................................................... 125

B.4 DATABASE SEARCH ........................................................................................................................ 126

B.5 DATA ANALYSIS .............................................................................................................................. 126

C Comparison of metabolomics profile of cardiogenic and septic shock patients ................................ 128

C.1 AIM OF THE ANALYSES .................................................................................................................. 128

C.2 PRELIMINARY RESULTS .................................................................................................................. 128

C.3 REMARKS ....................................................................................................................................... 129

D LIST OF PUBLICATIONS .......................................................................................................................... 133

Bibliography ……………………………………………………………………………………………………………………………………………117

List of abbreviations AA Amino acid AIC Akaike Information Criterion ATP Adenosine Triphosphate AUC Area Under the Curve BIC Bayesian Information Criterion BN Bayesian Network CE Capillary Electrophoresis CI Conditional Independence CS Cardiogenic Shock CV Cross Validation C1QA Complement C1q subcomponent subunit A DAG Directed Acyclic Graph DAMP Damage-Associated Molecular Pattern DIC Disseminated Intravascular Coagulopathy EGDT Early Goal Directed Therapy ETC Electron Transport Chain FA Fatty Acid FDR False Discovery Rate FIA-TOF-MS Flow Injection-Time-of-Flight Mass Spectrometry FP False Positive GC Gas Chromatography HMDB Human Metabolome Database HPLC High Performance Liquid Chromatography ICU Intensive Care Unit IDO Indolamine 2,3-dioxygenase iTRAQ Isobaric Tag for Relative and Absolute Quantitation LC Liquid Chromatography LCPUFA Long Chain Polyunsaturated Fatty Acid LDA Linear Discriminant Analysis lysoPC Lysophosphatidylcholines MI Mutual Information MN Markov Network MODS Multiple Organ Dysfunction Syndrome MOF Multiple Organ Failure mRMR minimal-Redundancy Maximal-Relevance MS Mass Spectrometry MSE Mean Square Error NMR Nuclear Magnetic Resonance NR Non respondent NS Non survivor OLS Ordinary Least Squares OOB Out-Of-Bag PAMP Pattern-Associated Molecular Pattern PC Phosphatidylcholines

List of abbreviations

PLA2 Plasma Phospholipase A2 PLS-DA Partial Least Squares Discriminant Analysis PRR Patter Recognition Receptor Q-TOF Quadrupole time-of-flight R Responder RF Random Forests ROC Receiving Operating Characteristic ROS Reactive Oxygen Species S Survivors SM Sphingomyelins SOFA Sequential Organ Failure Assessment SS Septic Shock SVM Support Vector Machines TN True Negative TNF Tumor Necrosis Factor TP True Positive VIP Variable Importance in Projection

Summary

According to reports in the United States and in Europe, shock affects about one third of

patients in Intensive Care Unit (ICU) for a total of more than 1 million victims a year. Septic shock is

a very common kind of shock and remains a major complication in critically ill patients, due to its

high lethality (40%), high-risk of second lines treatments and long-term physical and cognitive

impairments in survivors, with a 5-year mortality rate of 75% 1 . Septic shock is defined as a

complication of sepsis characterized by a life-threatening organ dysfunction caused by a

dysregulated host response to infection, which involves circulatory, cellular and metabolic

abnormalities2.

Although the pathophysiology of septic shock is not precisely understood, it has become

evident that it involves complex interactions between the pathogen and the host’s immune system.

More precisely, septic shock is characterized by an imbalance in the initial host response to infection

due to the fact that the physiological pro-inflammatory response is not adequately compensated by

anti-inflammatory mechanisms. This leads to an overwhelming and uncontrolled inflammation with

several harmful effects including: damage to cellular proteins, lipids and DNA by excessive

production of reactive oxygen species (ROS), compromised mitochondrial functionality, and

impairment of the coagulation cascade with subsequent formation of microvascular thrombi and

fibrin deposition. This latter event causes microcirculatory alterations which ultimately result in

poor tissue perfusion. Without adequate oxygen delivery, tissues deteriorate and, consequently,

organs begin to fail. This condition is known as multiple organ failure (MOF), i.e. organs not directly

injured by infection become dysfunctional due to systemic disorders involving immunoregulation

and endothelial dysfunction. When MOF occurs, the damage to tissues is already so extensive that

the patient is destined to die, even with an adequate medical intervention3,4.

Current treatments for septic shock are mainly devoted to restore homeostasis and to

prevent MOF. Since the transition to serious illness occurs in few critical hours, it has been

1 Iwashyna TJ, Ely EW, Smith DM, L. K. Long-term cognitive impairment and functional disability among survivors of

severe sepsis. Jama 304, 1787–1794 (2010). 2 Singer, M. et al. The Third International Consensus Definitions for Sepsis and Septic Shock (Sepsis-3). Jama 315, 801–

10 (2016). 3 Angus, D. C. & van der Poll, T. Severe sepsis and Septic Shock. N Engl J Med 369, 840–51 (2013). 4 De Backer, D., Orbegozo Cortes, D., Donadello, K. & Vincent, J.-L. Pathophysiology of microcirculatory dysfunction and

the pathogenesis of septic shock. Virulence 5, 73–9 (2014).

Summary

speculated that early recognition and treatment administration could provide maximal benefit in

terms of outcome. For this reason, an attempt to use early goal directed therapy (EGDT) has been

made for the treatment of shock patients in ICU. EGDT consists in an early hemodynamic assessment

on the basis of physical findings and vital signs to detect persistent global tissue hypoxia so to rapidly

balance oxygen delivery with oxygen demand.

Despite significant improvements in clinical care, accurate diagnosis and risk stratification

for septic shock patients remain a challenge and clinicians are still far to have found the optimal

therapy. Currently, the choice of the treatment is based only upon the traditional concept of sepsis

progression and corresponding clinical signs (i.e. organ hypoperfusion), thus it is not tailored on the

individual. Furthermore, given the highly variable and non-specific symptoms and clinical

presentations of this pathology, an early assessment of sepsis severity is not trivial and patients’

response to therapy is difficult to predict. The complex pathophysiology of sepsis suggests that a

single biomarker approach cannot adequately describe this syndrome. In fact, traditional biomarker

strategies, (e.g. measurement of plasma concentration of C-reactive proteins, procalcitonine,

cytokines, etc.) have not yielded a definitive biomarkers panel, since they cannot discriminate

individual patient responses and outcomes. Thereby, a comprehensive and integrated analysis of

molecular and clinical measurements is needed to plan an early and appropriate therapeutic

intervention.

In this context, information at molecular or cellular level provided by omics analyses (i.e.

genomics, transcriptomics, proteomics and metabolomics) may be a suitable mean to follow

responsiveness to therapy, to establish new therapeutic targets, and to enable the identification of

patients amenable to tailored therapies. Recently, the interest in metabolomics is increasing as it

may provide a more sensitive readout of individual response phenotypes. Metabolomics consists in

the analysis and quantification of thousands of metabolites, i.e. small molecular compounds which

constitute the end products of cellular metabolism and can thus be considered the chemical

fingerprint of an organism at a precise time point. Metabolomics approaches have become a

powerful tool for revealing molecular pathways and for identifying and quantifying differentially

expressed metabolites, independently from multiple trigger factors causing the disease. This aspect

is very promising for complex and multifactorial syndromes, such as septic shock, and makes

metabolomics analyses a suitable starting point toward personalized medicine.

Summary

In this thesis work, we focused on a homogeneous and well defined group of patients in the

same condition (i.e. severe septic shock) and on a short temporal window (i.e.one week or 48 hours

after diagnosis). The first cohort is constituted by a subset of 20 patients of the ALBIOS database

(Albumin Italian Outcome Sepsis study, NCT00707122)5, a multicenter clinical trial which enrolled

patients with severe sepsis or septic shock from 100 ICU in Italy. The second cohort of patients is

constituted by 21 septic shock patients from ShockOmics dataset (NCT02141607)6.

Our objective was to provide a thorough description of putative biological pathways which

characterize these patients’ cohorts in order to suggest possible biomarkers to be validated in

further investigations. As hundreds of metabolites can be measured and all of these can be

associated to more than one pathway, it is important not only to find changes or difference in

metabolites concentrations, but also to find possible associations among them, which can envisage

a prevailing pathway.

Thanks to the recent technological advances in high-throughput omics analyses, omics data

are becoming more and more accessible, giving origin to huge and heterogeneous datasets which

require specialized mathematical, statistical and bioinformatics tools, not fully available yet. Some

attempts to analyze and integrate omics data have been made and some general statistical tools,

such as unsupervised multivariate data analysis, correlation analysis or principal components

analysis, have been used and are currently implemented in different software packages. However,

given the complexity of this kind of datasets, traditional statistical tests cannot be used alone for a

robust analysis. In fact, the main objective of these methods is data exploration by considering each

feature as independent from the other variables of the dataset. Consequently, they do not allow for

a general and valid categorization of selected variables or, as in our case, for biological and

functional interpretation. As an attempt to circumvent this problem, data mining and alternative

feature reduction methods are proposed in this dissertation as a novel strategy for selecting and

prioritizing variables. In fact, by considering each features in relation with all the others, data mining

approaches enable to extract previously unsuspected information from omics datasets, thereby

5 Caironi, P. et al. Albumin replacement in patients with severe sepsis or septic shock. N Engl J Med 370, 1412–1421

(2014) 6 Aletti, F. et al. ShockOmics: multiscale approach to the identification of molecular biomarkers in acute heart failure

induced by shock. Scand. J. Trauma. Resusc. Emerg. Med. 24, 9 (2016).

Summary

they can be used to develop classification models and to elucidate possible associations among

species, which could reveal the metabolic pathways involved in the condition under study7.

As already outlined, omics datasets are characterized by a large number of highly correlated

features (hundreds or thousands) compared to the number of observations. For this reason, it is

necessary to develop suitable strategies by combining different data mining techniques, studied ad

hoc for the specific scientific question and for the kind of data considered. As detailed in Chapter 2,

the approach proposed in this dissertation combines features reduction techniques, linear and

logistic regression models and linear methods for classification.

Briefly, to perform feature reduction we adopted the minimal-redundancy-maximal-

relevance (mRMR) method proposed by Peng et al8. This algorithm sorts the features according to

their relevance to the outcome (maximum relevance criterion) and to their redundancy (minimum

redundancy criterion) with respect to the other variables. The ranking is based on the mutual

information between the outcome and each feature and on the mutual information between each

couple of features. Successively the featured selected and reduced in number were used to build

logistic regression models with the elastic net regularization approach. The elastic net performs both

variable selection and regularization in order to enhance the prediction accuracy and interpretability

of the statistical model it produces. This a shrinkage regression method is effective in case of several

highly correlated variables, since it performs continuous variable selection, causing some of the

regression coefficients to be exactly zero, thus eliminating redundant features. Moreover, the

subset of variables corresponding to non-zero coefficients can be considered as the ones mainly

associated with the outcome. Linear methods for classification, i.e. Linear Discriminant Analysis

(LDA) and Partial Least Squares Discriminant Analysis (PLS-DA), were also applied on our datasets.

Additionally, explorative analyses by probabilistic graphical models have been performed

with the aim to highlight conditional dependences among features. Probabilistic graphical models

aim to capture the underlying probabilistic relations between the domain variables and to express

them via a graph structure, easy to interpret. Specifically, the probabilistic relations are represented

as a network, where features constitute the nodes and the edges connecting them represents a

conditional dependence of the child node on the parent node. Thus, the absence of an edge

7 Baumgartner, C., Lewis, G. D., Netzer, M., Pfeifer, B. & Gerszten, R. E. A new data mining approach for profiling and

categorizing kinetic patterns of metabolic biomarkers after myocardial injury. Bioinformatics 26, 1745–1751 (2010). 8 Peng, H., Long, F. & Ding, C. Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-

Relevance and Min-Redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1226–1238 (2005).

Summary

connecting two nodes implies that the two corresponding variables are conditional independent,

given the other variables9. Two are the main families of graphical representations: the Bayesian

Networks, which use directed graphs (i.e. the edges are directed since they have a source and a

target), and the Markov Networks (MN), which uses undirected graphs. For our analyses we applied

undirected MNs, which were built using a two-step approach. Firstly, the maximum likelihood

network was found using the algorithm of Chow and Liu10, then a forward search was performed on

each triangulated graph of the network until no more add-eligible edges are found.

The first study, presented in Chapter 3, is an explorative analysis aiming at providing absolute

quantitative information on changes in plasma metabolite levels measured one day and one week

after development of severe septic shock, so to relate these changes with mortality. The multicenter

ALBIOS clinical trial enrolled patients with severe sepsis or septic shock from 100 ICU in Italy. Among

the 1818 patients included in this trial, only 20 resulted suitable for our study according to the

following inclusion criteria: presence of septic shock, total SOFA score > 8, serum lactate > 4 mmol/L,

and availability of plasma samples at day 1 (acute state, D1) and day 7 (steady state, D7) after

diagnosis of septic shock. These criteria were chosen to have a subset of patients homogenous with

the ones of ShockOmics clinical trial. We examined plasma metabolome and clinical features of

these patients both at D1 and at D7. Patients were classified into two groups according to their

survival status 28 days after study enrollment: survivors (11 patients, S) and non-survivors (9

patients, NS). The two time points were chosen to verify the hypothesis that the metabolic changes

over the time reflect not only initial clinical characteristics but also the progression of the disease

and the long-term survival. We applied a targeted mass spectrometry-based quantitative

metabolomics approach using the Biocrates platform coupled to Triple-Quad 5500 LC-MS/MS

system. Association between metabolic patterns and mortality was assessed by univariate and

multivariate analyses, adjusted for clinical relevant variables. The outcome (S = 0, NS = 1) was

considered as output of the final model. The performance was evaluated by means of the accuracy,

i.e. the proportion of true classification among the total number of cases examined. Overall, our

results showed that the metabolite species mainly involved are the kynurenine and

lysophosphatidylcholines (lysoPC), the alterations of which has already been reported in septic

shock patients.

9 Nielsen, T. D. & Jensen, F. V. Bayesian Network and Decision Graph. Springer Science & Business Media (2009). 10 Chow, C. K. & Liu, C. N. Approximating discrete probability distributions with dependence trees. IEEE Trans. Inf. Theory

14, 462–467 (1968).

Summary

In light of these findings, we speculated that lipid homeostasis and tryptophan catabolism

might influence mortality in septic shock. Therefore, to acquire a more complete view of the

pathways involved in this complex syndrome, we further investigated the changes that occurred in

metabolites concentrations. Chapter 4 illustrates these analyses. The objective of this study was to

better characterize non survivors patients (NS) according to the variations that occurred in

metabolites concentration from D1 to D7, expressed as ratio D7/D1, and to integrate this

information with the variation in proteins and clinical data.

Protein signals, expressed as peak intensities, were measured by a multi-iTRAQ experiment

and the ratio D7/D1 was computed as for the metabolomics data. Three different classification

models were built: one for metabolites only, one for metabolites and proteins and one to integrate

metabolomics and proteomics data with clinical parameters. Firstly, feature reduction by mRMR

algorithm was performed in order to avoid multicollinearity. We considered the first 10, 20 and 30

ranked metabolites to build three different classification models. To find the most reliable set of

features, we performed 50 times an elastic net logistic model on a training set using a logit function

to fit the training set data and we selected the coefficients of the model with the minimal deviance.

We also applied another strategy: we used the shrinkage parameter λ corresponding to the model

with the minimal deviance to fit another elastic net model on a testing set and to obtain the

coefficients of the logistic regression. LDA and PLS-DA were also implemented. More precisely, LDA

was performed on the first 10 ranked metabolites and the coefficients for the linear boundary

between the first and second classes were retrieved. PLS-DA was performed both on the first 10

and 20 ranked metabolites, considering 3 PLS components. The results obtained confirmed that

early changes in plasma levels of lipid species are altered in non survivors as previously outlined. As

for proteins, the most important differences between the two groups are related to proteins

belonging to inflammation and to the coagulation cascade, which are two of the most important

pathways involved in septic shock progression. Finally, Markov Networks were built by combining

metabolomics and proteomics data as already done for the classification models. From a qualitative

analysis, it appears that features have different dependences in the two groups, possibly indicating

differences in the functioning of the molecular mechanisms involved. Overall, the results of this

study confirm the feasibility of our data mining approach for the analyses of metabolomic data and

for the integration with proteomics. In respect to our previous analyses on metabolomic data only,

the integration with proteomics seems to indicate the importance of the interaction between

Summary

inflammation, coagulation and the complement system in sepsis, which is in line with recent

findings.

Chapter 5 presents the study on Shockomics clinical trial database. The goal of this study was

to elucidate early metabolic signatures associated with the progression of septic shock and with

responsiveness to therapy, assessed as change in SOFA score measured at admission (T1, acute

phase) and 48 hours after (T2, post-resuscitation). We examined the plasma metabolome of 21

septic shock patients, classified into two groups according to the following criterion: patients for

which both SOFAT2 > 8 and Δ = SOFAT1- SOFAT2 < 5 were classified as not responsive to therapy (NR,

7 patients), the remaining as responsive (R, 14 patients). Untargeted and targeted mass

spectrometry-based metabolomics strategies were combined to cover as much as possible the

plasma metabolites repertoire. Firstly, a mass metabolic profiling, performed by direct flow

Injection-Time of Flight-MS, was applied as untargeted screening to explore the main perturbed

metabolic features. Afterwards, a targeted analysis to measure specific metabolic classes and the

magnitude of their variation was done with the same methods already outlined for ALBIOS dataset.

Variations that occurred in metabolites concentration from T1 to T2 was expressed as Δ = T2-T1.

Two models were built: one for targeted metabolomics data and one which integrated data from

the targeted and the untargeted approach. Also in this case, the mRMR algorithm was used for

preliminary feature selection. The top 10, 20 and 30 ranked features were used to build two elastic

net regression models as previously described. LDA and PLS-DA were also implemented. The

multivariate models showed that lower variation in the concentration of plasmalogens and of fatty

acids, in combination with a higher increment of alanine, were associated to non-responsiveness.

These findings supported the emerging evidence that lipidome alteration plays an important role in

the individual patient response to infection, thus the understanding of regulatory pathways of lipids

is important for the development of an effective and tailored therapy. Furthermore, alanine

indicated a possible alteration in glucose-alanine cycle which occurs in the liver thus providing a

different picture on liver functionality than bilirubin, which is usually used in clinics. These results

were also strengthened by the explorative analyses performed by Markov Networks from which it

appeared that lipids species and gluconeogenic amino acids are important regulatory nodes.

In conclusion, our results demonstrate the feasibility and the robustness of the proposed

approaches, in spite of the limited number of patients. In fact, the performances of our classification

models are good and the species identified are in line with other studies done on larger cohorts and

with investigations on the identified pathways. To the best of our knowledge, no other scientific

Summary

works have applied data mining technique to perform multilevel omics analyses with the aim to find

association between plasma metabolome changes and mortality or responsiveness to therapy.

Thus, our findings represent a significant advance in the field and could be an important step toward

the devise of a personalized therapy.

Even if these results are very promising, some limitations must be discussed. An important

limit of the study is represented by the small size of the datasets used to build the predictive models.

However, we tried to reduce the confounding factors by focusing on a homogeneous groups of

patients in severe septic shock, thus hypothesizing that the changes observed are mainly related to

shock progression and different prognoses. Another limitation is that metabolites concentrations

were measured only at two time points (i.e. 7 days and 48 hours after shock diagnosis): a more

frequent monitoring of metabolites temporal change might provide better insights on the pathways

activated at different stages of disease. Finally, we are aware that these results can be affected by

overfitting since we did not have an independent validation dataset. However, we must recall that

we are not interested in prediction but in the development of an approach to describe the current

datasets and to identify the main pathways involved in pathology progression within the studied

patients’ cohorts. A thorough investigation on such pathways requires specific experiments, which

takes into account specific organ and not the blood stream, enzymes or other byproducts of such

pathways. Despite these limitations, our results can be considered a further contribution in the

current clinical scenario.

Specifically, we observed that low levels of lysoPCs are associated with poor outcome.

Recently systemic lysoPCs treatment has been proved to be effective in rodent models of sepsis and

ischemia11. These observations seem to suggest that elevation of plasma levels of these lipids can

actually help to relieve serious inflammatory conditions, thus a systematic administration of lysoPC

therapies may be of great utility in the treatment of septic shock. Moreover, the integration analyses

of metabolomics and proteomics data highlight the interplay between lipid metabolism,

inflammation and coagulation. Inflammation and coagulation are strictly linked, thereby it has been

suggested that the use of drugs, such as aspirin, which act on both pathways may interfere with the

pathogenesis of sepsis. To this purpose, a study performed on septic shock patients assuming aspirin

showed a reduction in 30-day mortality rate compared to non-aspirin users, thus suggesting that

11 Yan, J. J., Jung, J. S. et al., Therapeutic effects of lysophophatidylcholine in experimental sepsis. Nat. Med. 161–167

(2004).

Summary

low-dose aspirin administration (100 mg/day) could constitute a putative treatment for septic

shock12.

In spite of these interesting findings, further investigations are needed to better elucidate

important pathophysiological mechanisms involved in septic shock progression and to identify novel

targets for the administration of a timely and effective therapy. Results validation on animal models

of septic shock (currently on going) will be performed in order to refine our assumption and to better

describe the involved pathways.

A comparison with cardiogenic shock patients enrolled in ShockOmics could be a further step

in order to identify common pathways associated, for instance, with acute heart failure. This

information will be very useful to understand the molecular mechanisms which triggers MOF and

heart failure and which are thus independent from the root cause of shock. Eventually, by merging

the information gained from animal experiments and from the analyses on cardiogenic shock

patients, we aim to identify inflammatory mediators and molecular markers activated in shock and

to provide a list of putative biomarkers and pathways involved in its progression in order to guide a

timely early goal directed therapy.

12 Falcone, M. et al. Septic shock from community-onset pneumonia: is there a role for aspirin plus macrolides

combination? Intensive Care Med. 42, 301–302 (2016).

1 INTRODUCTION

Septic shock remains one of the major problem in Intensive Care Unit (ICU), with high

mortality and high-risk second lines therapies. Current treatments are mainly devoted to restore

homeostasis and to prevent multiple organ failure (MOF), but clinicians are still far to have found

the optimal therapy. Since the transition to serious illness occurs in few critical hours, the so called

“golden hours”, it has been speculated that early recognition and cure administration could provide

maximal benefit in terms of outcome. For this reason, an attempt to use early goal directed therapy

(EGDT) has been made for the treatment of shock patients in ICU. EGDT consists in an early

hemodynamic assessment on the basis of physical findings and vital signs to detect persistent global

tissue hypoxia so to rapidly adjust cardiac preload, afterload, and contractility to balance oxygen

delivery with oxygen demand. Although several studies have been performed to evaluate its

efficacy, EGDT is still object of a huge debate and a general consensus about its usefulness and

reproducibility has not been reached yet. Improvements in septic shock patients’ survival rate are

still modest, mainly due to the absence of predictive parameters for the monitoring of drug delivery

and patient response. Up to now, most of the studies are devoted to find association with mortality

or with comorbidities but a clear picture on therapy effectiveness is lacking.

In the last years, the research community has become more and more aware that individual

response to therapy is crucial and that precision medicine could be an important aspect to treat also

acute illness conditions such as septic shock. Precision medicine is based on a multilevel approach

to tailor the therapeutics to individual patients, thus it extends personalized medicine beyond the

genome to include broader systems (e.g. the proteome and the metabolome). Specifically, the

interest in metabolomics has recently increased as the metabolites represent the end result of gene

and protein function and activity, therefore they may provide a more sensitive readout of drug

response phenotypes. Metabolomics dataset are very complex, thus data mining and machine

learning approaches represent valuable techniques for analysis and multilevel data integration. In

fact, these methods can help in elucidating early multilevel markers signatures which could reveal

the molecular pathways involved in septic shock.

This first chapter of the dissertation introduces the contest of the PhD thesis. A brief

overview on shock, with particular emphasis on septic shock, will be done. The pathophysiology of

this syndrome, the mechanisms which lead to multiple organ failure and its clinical management

1. Introduction

will also be briefly illustrated. Afterwards, an overview on omics data will be presented and the

issues still hampering multilevel data integration will be addressed. A description of the different

approaches currently in use to analyze metabolomics and proteomics data will be provided. The

leading idea of the present study is the need to shed lights on the pathways involved in septic shock

to promote early intervention and a personalized treatment.

1.1 SHOCK

1.1.1 Definition, description and causes

Shock is a syndrome affecting one third of patients in ICU and can be defined as a state of

organ hypoperfusion which causes an imbalance between oxygen delivery and oxygen consumption

to tissues with resultant cellular dysfunction and death. Mortality is still very high, ranging from 20%

to 50%, and depends on different factors, such as the kind of shock, the source of infection and the

leading cause. According to the driver mechanism, shock can be classified as follows[1]:

• Hypovolemic shock: due to the loss of fluid from the circulation which leads to a critical decrease

in intravascular volume. Common causes are: excess fluid loss due to dehydration (e.g. after

severe vomiting or diarrhea) or to diseases which cause excess urination (e.g. diabetes and kidney

failure), extensive burns, blockage in the intestine, pancreatitis or severe bleeding (hemorrhagic

shock).

• Cardiogenic shock: it is a relative or absolute reduction in cardiac output due to a primary cardiac

disorder. This leads to impaired myocardial contractility and, as a consequence, to a dramatic

decrease in the ability of the heart to pump blood. The main causes include: heart attack,

myocarditis, disturbances of the electrical rhythm of the cardiac muscle, mass or fluid

accumulation and blood clots which interfere with the normal flow out of the heart.

• Obstructive shock: it is caused by mechanical factors (usually a physical obstruction) that

interfere with filling or emptying of the heart or of great vessels. Obstructive shock can be caused

by: pulmonary embolism, atrial tumor or clot and pericardial tamponade (i.e. accumulation of

fluid in the pericardial space resulting in a compression of the heart).

• Distributive shock: it results from dilation of blood vessels and pooling of blood in the peripheral

intravascular space. This typically occurs as a consequence of anaphylactic shock, adverse

neurogenic stimuli (e.g. spinal cord injury) or invasion of bacterial endotoxins which directly act

1. Introduction

on blood vessels (e.g. sepsis). This latter case can lead to septic shock, a type of distributive shock

due to the progression of an infection.

In spite of the different root causes, all these kinds of shock have in common the collapse of

circulation and the resulting hypoperfusion which causes tissue anoxia and MOF. Anoxia potentiates

the loss of vasculature tone, which ultimately leads to death as a result of cardiorespiratory failure.

1.1.2 Pathophysiology

Shock represents a series of events that, if uninterrupted, act synergically to produce vicious

cycles that ultimately results in the patient’s death (Figure 1.1)[1],[2].

Figure 1.1- Pathogenesis of shock. Top row: possible causes of shock resulting in heart failure. Left: shock symptoms. Hypoperfusion of vital organ is the crucial effect in shock. Its consequences at cellular level are the shift to anaerobic metabolism and, if shock persists, cell anoxia and cell death. At systemic level, shocks triggers neurohormonal mechanisms which eventually leads to multiple organ failure. Form Damjanov,(2000)[1].

Depending on its severity, shock can be clinically classified into three stages: (1) non-

progressive or compensated, (2) progressive, and (3) irreversible. Early stages of shock are reversible

and treatable; however, once serious organ failure ensues, shock becomes irreversible.

At cellular level, reduced perfusion of vital tissues implies that cells receive an amount of O2

which is inadequate for aerobic metabolism. This condition is known as anoxia. As a consequence,

cells shift to anaerobic metabolism which is characterized by increased production of CO2 and

accumulation of lactic acid. Cellular function declines, and, if shock persists, irreversible cell damage

1. Introduction

and death occur. At systemic level, cardiac failure and the resultant hypoperfusion are initially

compensated by peripheral vasoconstriction (compensated shock). At the beginning,

vasoconstriction is selective, shunting blood to the heart and brain and away from the splanchnic

circulation. With shock progression, vasoconstriction involves the renal blood vessels and results in

renal hypoperfusion and in a decreased glomerular filtration rate. Low urine output and even anuria

are thus typical of this stage. Anoxia and anuria lead to metabolic acidosis which has a depressive

effect on the heart and further potentiates pump failure. At this stage, the compensatory

mechanisms are not adequate to counterbalance for the loss of blood volume, thereby blood

pressure declines to very low levels and the heart functionality begins to deteriorate (progressive

shock). Heart insufficiency raises intrapulmonary venous pressure, causing stagnation of blood in

the pulmonary circulation which favors the formation of pulmonary edema and affects the

alveolocapillary functional units. Lungs cannot function properly, and this further contributes to

general hypoxia.

As blood pressure further decreases, blood begins to clot in the small vessels. At the same

time, toxins are released from intestine and other tissues that suffer from severe ischemia. In fact,

tissue anoxia results in the release of numerous inflammatory cytokines which cause vasodilation

and promote fluid loss by increasing the permeability of the peripheral blood vessels. In response

to these events, an intense tissue deterioration begins and, without adequate medical intervention,

progressive shock evolves in irreversible shock. This last phase of shock is characterized by the

decreasing of heart functionality and by the progressive dilation of peripheral blood vessels. These

events eventually lead to death, regardless of the amount or type of medical treatment applied. In

fact, when shock becomes irreversible, the organs begin to fail and the so called multiple organ

failure (MOF) occurs. Under this condition, organs not directly injured by the original trauma

become dysfunctional due to systemic disorders involving immunoregulation and endothelial

dysfunction. The damage to tissues, including cardiac muscle, is so extensive that the patient is

destined to die, even if adequate blood volume is reestablished and the blood pressure is restored

to its normal value.

1.2 SEPTIC SHOCK

1.2.1 Definition and incidence

Septic shock definition and clinical criteria have been recently revised by The Third

International Consensus Definitions for Sepsis and Septic Shock[3]. According to the new definition,

1. Introduction

septic shock can be considered as a subset of sepsis, i.e. a syndrome characterized by a life-

threatening organ dysfunction caused by a dysregulated host response to infection. In septic shock,

the underlying circulatory, cellular and metabolic abnormalities are profound enough to

substantially increase mortality in respect to sepsis alone[3]. According to the updated guidelines

provided by the International Consensus[4], septic shock can be diagnosed by: i) hypotension (i.e.

systolic blood pressure <90mmHg or mean arterial pressure <65mmHg) and vasopressor

requirement to maintain a mean arterial pressure of 65 mmHg or greater; ii) serum lactate level

greater than 2 mmol/L (>18mg/dL) in the absence of hypovolemia. The persistence of both these

conditions is associated with hospital mortality rates higher than 40%[3],[5]. Moreover, previous

clinical studies have shown that hyperlactatemia and overtime trend of plasma lactate levels can be

considered reliable markers of severity and mortality[6].

The clinical manifestations of sepsis and septic shock are highly variable and subtle,

depending on several factors, such as age, sex, ethnicity, health condition of the patient, initial site

of infection, type of infection, and interval elapsed before treatment administration. Incidence rates

are known to increase with age, probably due to age-related differences in immune function[7].

As reported by the Healthcare Cost and Utilization Project databases, sepsis accounted for

more than $20 billion, corresponding to 5.2% of total United States hospital costs in 2011[8]. The

incidence of sepsis is continuously increasing and conservative estimates show that it is one of the

leading causes of mortality in critically ill patients worldwide[9]. In Europe, severe sepsis affects 90.4

cases per 100 000 adults per year and an overall hospital mortality is of 36%, as described in the last

Sepsis Occurrence in Acutely ill Patients study[10]. Patients who survive severe sepsis or septic shock

often have long-term physical and cognitive impairment and they are at risk for early death within

5 years, with mortality rates as high as 75%. This situation has a significant impact on health care

costs and also important social implications[11].

1.2.2 Pathophysiology

Although the pathophysiology of septic shock is not precisely understood, it has become

evident that it involves complex interactions between the pathogen and the host’s immune system.

More precisely, the initial host response to an infection triggers subsequent compensatory anti-

inflammatory mechanisms, which on the one hand contribute to the clearance of infection and

tissue repair, on the other are implicated in organ injury and in susceptibility to secondary

infections[4].

1. Introduction

When an infection occurs, pathogens are recognized by the host through pattern recognition

receptors (PRRs, e.g. toll-like receptors), which are proteins expressed by cells of the innate immune

system to identify two classes of molecules: pathogen-associated molecular patterns (PAMPs), and

damage-associated molecular patterns (DAMPs). PAMPs are highly conserved and unique structures

of microbial pathogens whereas DAMPs, or alarmins, are generated endogenously in presence of

cellular damage or injury. After interaction with PAMPs, PRRs trigger a complex signaling system.

This results in a series of concatenated events including the release of inflammatory mediators and

reactive oxygen species (ROS), local vasodilation, increased endothelial permeability, and activation

of coagulation pathways. In sepsis and septic shock, the physiological inflammatory response is

overwhelming and not adequately compensated by the anti-inflammatory mechanisms which

should limit its potentially harmful effect. More specifically, excessive production of reactive oxygen

species (ROS) damages cellular proteins, lipids and DNA and compromises mitochondrial function.

Moreover, whereas the impairment of the coagulation cascade promotes formation of

microvascular thrombi, and fibrin deposition, thus causing microcirculatory alterations which

ultimately result in poor tissue perfusion[4],[12]. The interactions between inflammation,

coagulation and complement activation during sepsis progression is illustrated in Figure 1.2.

Inflammation and coagulation are tightly inter-connected in septic shock. In fact, uncontrolled

inflammation promotes disseminated intravascular coagulopathy (DIC), a syndrome characterized

by massive thrombin production and platelet activation, coupled with impaired fibrinolysis and

microvascular thrombosis. The combinations of these events results in consumptive coagulopathy

and bleeding which contribute to organ failure. DIC is a central event in the pathophysiology of

sepsis and one of the most important marker of poor prognosis[13].

Figure 1.2 - Schematic representation of the interactions between inflammation, coagulation and complement activation during sepsis progression. From Lupu et al (2013)[13].

1. Introduction

At systemic level, all the events here described results in an inadequate oxygen delivery to

peripheral tissues and in consequent tissue anoxia which leads to the same pathology progression

already described for shock.

1.2.2.1 Effect of hypoxia at cellular level: the role of mitochondria

It has been observed that patients with sepsis have dysfunctional mitochondria which are

damaged by high levels of ROS[14]. Mitochondria are membrane-bound organelles found in most

eukaryotic cells. They are usually described as the "cellular power plant" because within them most

of the cell supply of adenosine triphosphate (ATP) is generated. In addition to supplying cellular

energy, mitochondria are also involved in other essential tasks such as signaling, cellular

differentiation, cell death, control of the cell cycle and cell growth. The number of mitochondria in

a cell varies widely by organism and tissue type, according to their energy demand[15],[16].

Mitochondria have a double membrane structure: the outer membrane encapsulates the

organelle whereas the inner one surrounds the central matrix space. These two phospholipidic

membranes separate four distinct compartments (Figure 1.3)[16],[17]:

1. outer membrane: it provides a permeability barrier and contains several integral proteins,

called porines, which regulate substances exchange;

2. intermembrane space: it has an ionic composition similar to the cytosol but it contains also a

distinct group of carrier proteins specific of the mitochondrion;

3. inner mitochondrial membrane: it is highly folded into cristae which greatly increase its surface

area. It is a highly specialized membrane since its lipid bilayer contains cardiolipin, a four-tailed

phospholipid which makes the membrane especially selective for ions. This membrane also

houses the electron transport chain (ETC) complexes and respirasomes, thus giving structural

support to the phosphorilation apparatus;

4. matrix: contains hundreds of metabolic enzymes, ribosomes, mitochondrial DNA and RNA. It is

here that ATP is produced through oxidation of pyruvate and fatty acids which enter then the

Krebs Cycle (see Figure 1.4).

1. Introduction

Figure 1.3 – Mitochondrion structure. It has a double membrane: the inner one contains the ETC apparatus and has deep grooves (cristae) which increase its surface area. The ATP synthesis occurs inside the mitochondrial matrix where also mitochondrial DNA and RNA are contained.

The particular structure of the mitochondrion provides a compartmentalization of its

metabolism: the membranes resemble a sieve which regulates substrate and waste product

exchange whereas all the reactions necessary for energy production occur in the matrix. The main

substrates for mitochondrial oxidation are pyruvate and fatty acids (FA). Pyruvate comes from

glucose or other sugars originated from carbohydrates metabolism while FAs come from fats. Both

these fuel molecules are transported across the inner mitochondrial membrane and then converted

to the crucial metabolic intermediate acetyl-CoA by enzymes located in the mitochondrial matrix

(Figure 1.4)[16],[17].

Figure 1.4 - Representation of classic pathways of cellular metabolism. Substrates (glucose and FAs) are transported across the cell membrane into the cytosol where they are activated to pyruvate and acetyl-CoA. These two metabolic intermediates are transported inside the mitochondrion by specific transport systems. Once inside, the substrates enter the Krebs Cycle and their reducing equivalents are used by the electron transport chain to generate a proton gradient which is used for ATP production. From Doenst et al[18].

1. Introduction

The main mechanisms involved in cellular metabolism are listed in the following:

• Glucose use: the glucose used for ATP generation either comes from uptake of exogenous

molecules or, to a lesser extent (<40%), from glycogen stores. Glucose is transported into the

cytosol by glucose transporters, among which GLUT 1 and GLUT 4. In the cell glucose is

phosphorylated to glucose-6-phosphate which enters the glycolytic pathways. Glycolysis

generates pyruvate which can be transported into the mitochondrial matrix where it is oxidized

to acetyl-CoA by the multienzyme complex pyruvate dehydrogenase[18]. This complex is a key

regulator of pyruvate oxidation since it is inhibited by accumulation of end product of FA

oxidation[17].

• FA use: the oxidation of FAs is a complex process which occurs within the mitochondria and

represents the major source of energy for cells. The process by which FAs are broken down into

energy involves different kinds of proteins, listed in Table 1.1, and it can be divided into several

steps, as schematized in Figure 1.5[18],[19]:

1. uptake of FAs into the cytosol, facilitated by transport proteins and plasma membrane FA-

binding proteins (FABP, FAT)[20];

2. addition of a CoA group by fatty-acyl-CoA synthase (FACs) and formation of long chain acyl-

CoA, a temporary compound which can enter the mitochondria;

3. conversion of long chain acyl-CoA in long-chain acylcarnitine by carnitine

palmitoyltrasferase (CPT I). This reaction represents a crucial regulatory node in FA

oxidation. In fact, this enzyme is subject to feedback inhibition by the acyl-CoA breakdown

product malonyl-CoA that builds up during high rates of FA oxidation. Accumulation of

malonyl-CoA reduces FA oxidation and further increases cytoplasmic free FA and acyl-CoA

metabolites leading to energy deficiency[21],[22].

4. transportation into the mitochondrial matrix via carnitine translocase (CAT). Inside the

matrix long-chain acylcarnitine is converted back to long-chain acyl-CoA by CPT II;

5. inside the mitochondrial matrix, long chain acyl-CoA molecules are broken down to acetyl-

CoA, which is then oxidized in the Krebs Cycle.

1. Introduction

Figure 1.5 - Schematic representation of FAO. FAs primarily enter a cell via fatty acid protein transporters on the cell surface (FABP, FAT). Once inside, FACS adds a CoA group to the fatty acid which is then converted to acylcarnitine by CPT I. Acylcarnitine is transported by CAT across the inner mitochondrial membrane. Once in the matrix CPT II converts the acylcarnitine back to acyl-CoA which enters the fatty acid β-oxidation pathway, resulting in the production of one acetyl-CoA from each cycle of β-oxidation. Acetyl-CoA then enters the Krebs cycle. The NADH and FADH2 produced by both β-oxidation and the Krebs cycle are used by the electron transport chain to produce ATP[21].

• ATP production: the common end product of glucose and FAs oxidation is acetyl-CoA which

enters the Krebs Cycle. The cycle converts the carbon atoms from acetyl-CoA to CO2, which is

released from the cell as a waste product. The cycle generates also high-energy electrons,

carried by the activated carrier molecules NADH and FADH2. These high-energy electrons are

then transferred to the inner mitochondrial membrane, where they enter the ETC for oxidative

phosphorylation. As electrons move along this chain, energy is stored as an electrochemical

proton gradient across the inner membrane which is used to drive ATP synthesis[16]. Finally,

ATP is transported from the mitochondrial matrix to the cytoplasm through the adenine

nucleotide transporter, making energy available for cellular work.

PROTEIN NAME FUNCTION Fatty acid binding protein FABP Peripheral membrane protein: traps FAs to facilitate their absorption[20] . FA translocase FAT Integral membrane protein: enables FAs to enter the cell[20].

Fatty acyl-CoA synthase FACs Cytosolic enzyme responsible for esterification of FAs to long chain fatty acyl-CoA[21].

Carnitine palmitoyltrasferase I CPT I Located in the outer mitochondrial membrane, this enzyme converts acyl-CoA in acylcarnitine so that it can enter the mitochondrion.

Carnitine translocase CAT Shuttles the acylcarnitine across the inner mitochondrial membrane[19]

Carnitine palmitoyltrasferase II CPT II Located in the inner mitochondrial membrane, this enzyme converts acylcarnitine back to acyl-CoA[23].

Table 1.1 – List of proteins involved in FA oxidation and main functions.

1. Introduction

Also ROS production is an important aspect of mitochondria life-cycle. Under normal

circumstance about 98% of mitochondrial oxygen combustion is linked to ATP formation through

oxidative phosphorylation. A by-product of this process is the generation of ROS, a variety of

molecules and free radicals derived from molecular oxygen[24]. ROS are important in the redox

signaling from the mitochondrion to the rest of the cell but they also contribute to mitochondrial

damage in several pathologies. In fact, ROS production can lead to oxidative damage to

mitochondrial proteins and membranes, thus impairing the ability of these organelles to synthesize

ATP and to carry out their wide range of metabolic functions which are central to the normal

operation of cells. Mitochondrial DNA is also susceptible to attack by ROS. Since expression of the

entire mitochondrial genome is required to maintain the functional integrity of mitochondria,

mtDNA damage and depletion results in an impaired mitochondrial respiratory capacity and in cell

growth arrest[25]. Mitochondrial oxidative damage can also increase the tendency of mitochondria

to release intermembrane space proteins and thereby activate the cell apoptotic machinery[26].

Recent investigations indicate that mitochondria damage contributes significantly to the

pathogenesis of sepsis-induced MOF through dysregulation of oxygen metabolism, i.e. cytopathic

hypoxia, accelerate oxidant production and cell death promotion[25]. Cytopathic hypoxia results in

impaired oxygen utilization and development of tissue acidosis, as demonstrated by excess lactate

levels in the blood of septic shock patients. Cellular oxygen metabolism can be impaired by a number

of different mechanisms but there is evidence that it is primarily altered at cellular level, particularly

in the mitochondria. In fact, during sepsis several critical components of the ETC are compromised,

thereby mitochondria are no longer able to efficiently provide energy to the cell. In addition to tissue

hypoxia, also some mediators of the innate immune system, such as tumor necrosis factor α (TNFα)

and various interleukins, have an important role in the etiology of cytopathic changes and

mitochondrial injury. In particular, TNFα acts via receptor-mediated signaling pathways and triggers

mechanisms which induce cytotoxicity. Mitochondrial death pathways are also involved in the

depletion of lymphocytes and intestinal epithelial cells thus compromising the host’s immune

response and physical barriers to infection.

1.2.3 Treatment of septic shock

The current treatment for septic shock patients is mainly devoted to restore hemostasis and

to prevent MOF. The Surviving Sepsis Campaign[27] recently provided the updates guidelines for

the management of severe sepsis and septic shock, which we briefly summarize here. Within 6 hours

1. Introduction

after shock diagnosis, initial aggressive resuscitation with vasopressor administration is provided to

patients with hypotension. In fact, septic shock patients require the administration of fluids, usually

crystalloids, and vasoactive agents, such as noradrenaline, in order to avoid prolonged hypotension.

They often receive lung support with a ventilator in order to achieve adequate oxygenation. In this

initial phase bacterial cultures are also collected to make a possible diagnosis and to start an

empirical antimicrobial therapy. Early antimicrobial therapy is of primary importance in the

treatment of septic shock, since the prompt administration of an appropriate therapy significantly

reduces mortality risk[28]. The choice of the therapy depends on several factors, such as the

suspected source of infection and the medical history of each patient. Overall, the aim of the

treatment is both to restore hemodynamic stability and to mitigate the effect of uncontrolled

infection. This initial stage is followed by the so called supportive therapy: patients are continuously

monitored and a set of different procedures, aimed to support organ function, can be performed,

such as blood product administration, mechanical ventilation and renal replacement therapy. When

possible, de-escalation of therapy is done to avoid the emergence of resistant organisms and to

minimize the risk of drug toxicity[4],[27]. In spite of the progresses made in the understanding of

the underlying biologic features of sepsis, clinicians are still far to have found the optimal clinical

treatment, thus new strategies should be adopted to translate advances in molecular biology into

effective new therapies.

1.2.4 The SOFA score

To assess efficacy and cost-effectiveness of new therapies, mortality alone is not a sufficient

parameter. In clinical practice, the severity and progression of organ dysfunction is usually assessed

by means of scores. Several scoring systems have been introduced to quantify abnormalities

according to clinical findings, laboratory data and therapeutic treatments. The predominant one,

currently in use, is the Sequential Organ Failure Assessment (SOFA) score[3]. This score is based on

the assumption that organ failure is not an “all-or-none” phenomenon but rather a dynamic process

for which also the progression of events is important. A regular assessment of organ function is

therefore necessary to follow the evolution of the disease. The SOFA score estimates dysfunctions

regarding liver, kidney, coagulation, cardiovascular, respiratory and central nervous systems; it also

accounts for clinical interventions and laboratory variables like PaO2, platelet count, creatinine and

bilirubin levels. It is composed of six scores, one for each organ system (respiratory, cardiovascular,

hepatic, coagulation, renal and neurological), graded from 0 to 4 according to the degree of

1. Introduction

dysfunction, 4 being the worst condition[29]. A higher SOFA score is thus associated with an

increased probability of mortality. Both the variation, the mean and highest SOFA scores are

predictors of outcome. Specifically, an increase of 2 points or more is associated with an in-hospital

mortality greater than 10%[3] and maximum SOFA greater than 15 points is associated with a

mortality rate above 90%[29].

1.3 OMICS DATA

Omics data (e.g. genomics, transcriptomics, proteomics and metabolomics) provide system-

level information for all type of cell components and interactions in an organism. Each type of omics

data describes a different step of the biological information flow, starting from DNA till the

expression of a particular cell phenotype (Figure 1.6).

Figure 1.6 - Flow of biological information represented also as omics data

Briefly, in every organism DNA (genomics) is first transcribed to mRNA (transcriptomics) and

translated into proteins (proteomics). Proteins, such as enzymes or transcription factors, catalyze

reactions of which metabolites (metabolomics), glycoproteins and oligosaccharides (glycomics) and

various lipids (lipidomics) are byproducts. All processes involved are generally dictated by different

1. Introduction

kinds of molecular interactions: protein-DNA interactions in the case of transcription, protein-

protein interactions and enzymatic reactions in translational processes. Ultimately, the metabolic

pathways comprise integrated networks, or flux maps (fluxomics), which dictate the cellular

behavior or phenotype (phenomics)[30].

Although the knowledge of each type of omics data is crucial for global understanding of

cellular processes, this information alone is not sufficient to gain a comprehensive view of all the

biological mechanisms involved in the phenomenon under study. Integrated approaches combining

two or more omics field (e.g. metabolomics and proteomics), are thus required to gain deeper

insights. Although this information is usually available, multilevel integration of different omics data

is still an open issue and effective data integration is far to be achieved. In fact, the handling,

processing, analysis and integration of omics data require specialized mathematical, statistical and

bioinformatics tools which are not fully available yet. Several technical problems are still hampering

a rapid progress in the field, and researchers usually have to compare multiple databases and to

manually extract and assemble the information needed[31]. Hence, it is easy to infer that a

meaningful comparison and exchange of omics data, obtained from different platforms or different

laboratories, are cumbersome. This is mainly due to the lack of standards for data formats, data

processing parameters, and data quality assessment. In addition to this, it is extremely challenging

to figure out how to actually analyze the huge amounts of data generated and how to deduce

biological insights. The necessity of an integrated pipeline for comprehensive analysis of complex

omics data sets has therefore become a critical aspect of multilevel data integration studies[32],

[33]. A more detailed review on omics data analysis and on the currently available software has

been published in Briefings in Bioinformatics1.

1.3.1 Metabolomics

Metabolomics is a rapidly growing field of biological sciences which has lately reached

widespread applications in many different areas, including molecular epidemiology, biomarker

discovery and identification, drug development and personalized health care[34]. It consists in the

analysis and quantification of thousands of metabolites, i.e. small molecular compounds (< 1500

Dalton) which constitute the end products of cellular metabolism and can thus be considered the

chemical fingerprint of an organism at a precise time point. Overall, the aim of a metabolomic study

1 A. CAMBIAGHI, M. Ferrario, M. Masseroli, “Analysis of metabolomic data: tools, current strategies and future challenges for omics data integration”, Briefings in Bioinformatics (2016)

1. Introduction

is to correlate the changes in metabolite concentration with pathological states, or with the effect

of environmental influencing factors, such as drugs or contaminants[35].

The analytical approaches to perform a metabolomic analysis are two: targeted and

untargeted. Targeted metabolomics is the measure of a small set of known compounds quantified

according to a standard. The choice of the set of metabolites is driven by a specific biochemical

question so it includes one or more already defined pathways. The main limitation of the targeted

approach is that an a priori knowledge of the compounds of interest is needed, therefore this

method is less suitable for the discovery and identification of novel metabolic markers[36], [37]. The

untargeted approach is more global in scope since it does not depend on an a priori hypothesis. The

aim of this technique is to simultaneously measure as many metabolites as possible without any

bias. Variations in metabolite concentrations are observed as total changes of chromatographic

patterns without requiring the previous knowledge of the compounds under investigation. Each

metabolite produces one or more chromatographic peaks, which correspond to an ion with unique

mass-to-charge ratio and retention time. Masses are not precisely measured, it possible to have

only their relative quantification, usually expressed as fold change[37], [38]. Metabolite

identification from chromatographic peaks is a manual and time-consuming process and, due to the

lack of annotation databases of high coverage, the association of metabolites with their spectra is

still a challenge[37].

Absolute metabolomic targeted approach would apparently sound limited and risky when

compared to untargeted strategies, which emphasize the global behavior of the metabolome. In

fact, one may argue that a targeted approach might miss novel metabolic features linked to

metabolic derangement in the condition under study. However, untargeted approaches have

several drawbacks such as difficulties in identifying all the detected signals, the reliance of the

intrinsic analytical coverage of the MS platform employed, the bias towards the detection of highly

abundant metabolites and the lack of absolute quantitative information on metabolites. Targeted

metabolomics provides indeed quantitative information about the molar concentrations of the

metabolites involved in a pathway, facilitating the immediate understanding of any alterations

between different biological states.

1.3.1.1 Techniques for metabolomics data analyses

There are several techniques which can be used for metabolite profiling, each one has

associated advantages and drawbacks. Thereby, a combination of different approaches is usually

1. Introduction

applied to gain a broader prospective. Two are the main analytical methods: mass spectrometry

(MS) and nuclear magnetic resonance (NMR) spectroscopy[38].

MS is the most widely applied technology in metabolomics, due to its sensitivity and to the

wide range of covered metabolites. Mass spectrometers operate by ion formation, separation of

ions according to their mass-to-charge (m/z) ratio and detection of separated ions. MS can be used

to analyze biological samples either directly via direct-injection MS or coupled with

chromatographic or electrophoretic separation[39]. However, given the heterogeneity of the

landscape of the metabolome, this latter strategy is preferred to decrease sample complexity. The

most commonly used chromatographic separation techniques are gas chromatography (GC), high-

performance liquid chromatography (HPLC) and capillary electrophoresis (CE)[40]. After separation,

data are usually collected on a quadrupole time-of-flight (Q-TOF) mass spectrometer or by ion trap

instruments such as an Orbitrap[37]. The metabolomics analyses presented in this dissertation have

been performed by Q-TOF, as described in Appendix A. Q-TOF mass spectrometry is based on the

assessment of an ion's m/z by a time measurement. Briefly, ions are accelerated by an electric field

of known strength, thus each ion has the same kinetic energy as any other ion with the same charge.

The velocity of the ion depends on the m/z: for equal charge, the heavier ions are slower, whereas

for equal mass ions with higher charge are faster. The time each ion takes to reach the detector

depends on its velocity, and therefore it is a measure of its m/z. From this ratio and from known

experimental parameters, it is possible to identify the ion.

Also NMR is used for metabolites detection since it is a rapid and non-destructive method

which requires only minimal sample preparation. NMR spectroscopy functions by the application of

strong magnetic fields and radio frequency pulses to the nuclei of atoms, thus it exploits the

magnetic properties of certain atomic nuclei to determine the physical and chemical properties of

molecules. The output of a NMR analysis is a spectrum which is often convoluted and hard to

interpret. This aspect and the low sensitivity of this techniques, make NMR inappropriate for the

analysis of large number of low-abundance metabolites[38],[39].

1.3.1.2 Metabolomics analysis workflow

In the following, we provide a brief overview of how a metabolomic analysis is

performed[37],[38],[41]. A typical metabolomic study consists of several steps, as illustrated in

Figure 1.7, which can be grouped as follows[31],[38],[42],[43]:

1. Introduction

Figure 1.7 - Flow chart of a typical metabolomics study. After sample preparation, specific metabolic signals are acquired using heterogeneous analytical platforms (DATA ACQUISITION). Raw signals are then pre-processed to produce data in a suitable format for univariate and multivariate statistical analysis. (DATA PROCESSING). Significantly expressed metabolites are then linked to the biological context, through enrichment and pathway analysis, and mapped into networks. Finally, metabolomics data are integrated with other ‘omics’ data and with prior knowledge to gain a comprehensive view of the molecular processes involved (DATA INTERPRETATION AND INTEGRATION).

1. SAMPLE PREPARATION: according to the kind of sample under analysis (i.e. blood plasma,

serum, urine, saliva, solid tissues or cultured cells), a different approach must be adopted. In

fact, a correct sample preparation is essential to ensure the optimal extraction of metabolites

and thus reduce experimental error[40],[44].

2. DATA ACQUISITION: different approaches are used to separate and chemically characterize

diverse groups of metabolites on the base of both their chemical and physical properties. As

previously outlined, compound separation techniques such GC, HPLC and CE are combined with

compound detection techniques, such as MS or NMR. Each method has different resolution,

sensitivity and technological limitations in identifying metabolites thus it should be chosen in

accordance to the chemical and physical characteristic of each sample and to the kind of

analysis performed (targeted or untargeted)[40],[41],[43].

3. DATA PROCESSING: to facilitate compound quantification, the acquired raw signals

(chromatograms, spectra or NMR data) are pre-processed by ad hoc software tools such as the

commercial software SIEVETM by Thermo Scientific (www.thermofisher.com) or the cloud-based

1. Introduction

platform XCMS (Center for Metabolomics at the Scripps Research Institute[45]). Some groups

also use in-house scripts developed with different softwares (e.g. Matlab). The pre-processing

stage usually involves noise reduction, retention time correction, peak detection and

integration, and chromatogram alignment. In untargeted studies, metabolites have to be

identified from spectral information, usually by means of different databases search, such as

the Human Metabolome Database (HMDB, http://www.hmdb.ca/[46]) or the MEtabolite and

Tandem MS Database (METLIN, http://metlin.scripps.edu[47]). Once the metabolites list is

ready, a statistical analysis is performed to find significant differences between sample sets. A

typical statistical analysis for metabolomic data consists of two phases: a more general analysis

using traditional statistical methods followed by a more focused investigation applying data

mining strategies. Overall, traditional statistical methods are used to gain a global view of the

considered datasets and to identify which metabolites significantly change under the studied

conditions. A limit of traditional approaches is that they highlight relationships among variables

based only on mathematical criteria (e.g., maximization of variance, or correlation) thus they

do not take into account correlations of biological origin[48]. This analyses should be combined

with data mining techniques which allow to better discriminate groups of functionally related

metabolites (i.e., metabolite sets) which can be used for biological interpretation[36], [42].

Chapter 2 will present the main techniques of data mining, emphasizing the ones adopted in

this study.

4. DATA INTERPRETATION AND INTEGRATION: in this final step, the set of significant metabolites

previously found is linked to the biological context under study. In fact, to better understand

the biological role of each metabolite, the chemical information derived from metabolomics

analyses has to be related to both their biochemical origins and physiological

consequences[35],[43]. This can be done through enrichment and pathway analyses.

Enrichment analysis aims to investigate the enrichment (i.e., over and/or under-expression) of

predefined groups of functionally related metabolites in order to find significant expression

changes among them. Moreover, the identification of altered metabolites allows to select

specific biological pathways, or disease condition, which can be further investigated[31].

Pathway analysis involves the description and visualization of the interactions among genes,

proteins, or metabolites within cells, tissues or organs. Its goal is to identify the pathways which

significantly impact on a given biological process[49]. Enrichment and pathway analyses are

usually performed using specific software tools (e.g. MetPa[49]), which map significant

1. Introduction

metabolites to known biochemical pathways on the basis of the information contained in public

databases (e.g. the Kyoto Encyclopedia of Genes and Genomes (KEGG))[50]. Pathway data are

usually presented as networks, with metabolites as nodes and reactions as edges. To obtain a

comprehensive view of all the biological processes involved, the information regarding the

metabolic pathways has to be integrated with transcriptomics and proteomics data[35],[51].

Integration with biological knowledge derived from the literature or from previous

experimental data is also suggested to reach a more reliable evaluation of the process under

study[43],[48].

1.3.2 Proteomics

Proteomics refers to the large-scale analysis of the proteins encoded by the genome. It

involves the application of different technologies to detect and quantify the overall proteins content

of a cell, tissue or organism in order to understand proteins structure and function. As

metabolomics, proteomics is used in many research fields such as biomarkers discovery, vaccine

production and study of the alteration of expression patterns in response to different stimuli or

disease states. Proteomics analyses are very complex because they consist in the identification of

the protein signatures of the whole genome, which differ from cell to cell and from time to time.

For instance, the human genome harbors from about 26000 to 31000 protein encoding genes,

whereas the total number of human protein products, including splice variants and essential post

translational modifications, is estimated to be close to one million. In spite of the advance of new

technologies, comprehensive proteomics analysis of biological samples (e.g. plasma, serum or other

bodily fluids or tissues) has not been fully developed yet, mainly due to high complexity of the

samples and to the wide dynamic range of protein concentrations. Therefore, processing and

analysis of proteomics data is a very long and complicated process[52],[53].

There are two main approaches to proteomic analyses: top-down and bottom-up. In top-

down proteomics, intact proteins or large fragments are ionized and analyzed by mass

spectrometry, whereas the bottom-up rely on peptides, generated by proteolytic digestion of

protein samples. Since top-down proteomics is limited by protein size (<50kD), bottom-up

techniques are currently the most commonly used[54]. Like metabolomics, proteomics analyses can

be further divided into targeted and untargeted. Targeted proteomics experiments are hypothesis-

driven, thus they are designed to quantify a limited number of proteins (i.e. less than one hundred)

1. Introduction

with very high precision. Untargeted proteomics studies instead aim at identifying as many proteins

as possible across a broad dynamic range.

Irrespectively from the approach adopted, several different techniques can be applied to

perform a proteomics analysis[53]. Generally, two are the main strategies: gel-based and shotgun

proteomics, both of which include a great variety of analytical methods. Gel-based applications

consist in one-dimensional and two-dimensional polyacrylamide gel electrophoresis and they have

been developed well before the term proteomics was coined. In spite of this, they are still

extensively used mostly for qualitative experiments, protein separation and quantitative expression

profiling[55]. The main drawbacks of these techniques is the poor reproducibility and inability to

detect certain classes of proteins including acids, basis, low abundance and hydrophobic ones.

Shotgun proteomics, also called gel-free or MS-based techniques, have become the most common

method for proteomics analyses since they are more sensible and reproducible than gel-based ones.

They include several methods among which multidimensional protein identification technology,

isotope-coded affinity tag, stable isotope labeling with amino acids in cell culture, and isobaric

tagging for relative and absolute quantitation (iTRAQ)[56]. This latter approach will be further

described in the following since data from iTRAQ experiments was analyzed in this study to

complete the metabolomics analyses. Additional details on the experimental procedure adopted in

this study can be found in Appendix B.

1.3.2.1 The isobaric tags for relative and absolute quantification (iTRAQ) method

The iTRAQ technique can be used both for relative and absolute quantitation and enables to

analyze from 4 to 8 different samples simultaneously. It is based on the use of stable isotope labeled

molecules (i.e. isobaric reagents) which covalently bond to the N-terminus of the primary amines of

peptides and proteins. The iTRAQ analytical process is long and consists of different steps. Firstly,

proteins are isolated from the biological samples (protein extraction) to obtain a protein isolate.

Several strategies can be adopted for protein extraction according to the kind of sample being

analyzed and to the purpose of the study. For instance, extraction is more selective for a targeted

study than for an untargeted one. After extraction, high abundant proteins have to be removed to

avoid any bias, usually by chromatographic separation techniques. Albumin is very abundant in

human plasma, thus it is usually removed prior to performing a proteomics analysis in order to

better detect and quantify lower abundance proteins. Before iTRAQ labelling, proteins are digested

using an enzyme, usually trypsin, to generate smaller proteolytic peptides. This is done since most

1. Introduction

proteins are too big to fall within the limited mass range which a typical mass spectrometer can

measure. Each sample is labeled with a different iTRAQ reagent and then combined into one sample

mixture (Figure 1.8).

Figure 1.8 - Example of iTRAQ proteomic quantitation used on 6 different samples. After trypsin digestion, samples are labeled with individual mass tags and then combined in a unique mixture for LC-MS/MS analysis. Since the masses of all of the tags are the same, identical peptides from different samples elute together in the LC column. After the analysis by tandem MS, the removed tags enable to quantitate relative peptide intensities, while the peptide fragment ions are sequenced for protein identification.

The combined samples mixture is then analyzed by liquid chromatography and tandem mass

spectrometry (LC-MS/MS) for both identification and quantification. Liquid chromatography

enables to divide the peptides mixture in smaller sub-samples in order to simplify subsequent

analysis and results interpretation; eluted compounds are then injected into the mass spectrometer.

In the first round of MS, peptides are ionized and their mass-to-charge ratio measured to yield a

precursor ion spectrum. During the second MS, the isobaric tags are broken off and quantification

is performed based on the relative abundance of these tags. The relative quantity of a peptide

among the treated samples is determined by comparing the intensities of reporter ion signals also

present in the MS/MS scan[56],[57].

1.3.2.2 Proteomics data analysis and interpretation

After data acquisition with any of the methods previously mentioned, the raw signal

obtained (e.g. chromatographic peaks for iTRAQ analyses) have to be translated into proteins and

into biological information. This is a complex process which can be summarized in four main

stages[33]:

1. Introduction

1. DATA PROCESSING: raw data are processed to extract relevant information. After peak

detection, more elaborate algorithms are used to discriminate signal from noise and to derive

more accurate measurements.

2. PEPTIDE IDENTIFICATION: the obtained MS spectra should be compared with the ones collected

in the available databases in order to find matches with known peptides. Several engines are

available such as Sequest, Mascot, Comet, X!tandem. Protein identification via database

searches is computationally intensive and time-demanding. In fact, the assignment of spectra

to peptide sequences is not direct, but it involves matching and scoring large sets of

experimental spectra with predicted masses from fragment ions of peptide sequences. The

various search engines do not yield identical results as they are based on different algorithms

and scoring functions. This makes comparison and integration of results from different studies

extremely challenging[33].

3. PROTEIN IDENTIFICATION AND VALIDATION: in this step, identified peptides are reassembled

in silico into proteins. The association of identified peptides with their precursor proteins is a

very critical procedure since many peptides are common to several proteins, thus making

protein assignments quite ambiguous. For this reason, ad hoc tools like ProteinProphet or

Mascot, are used to assess the validity of the protein inference and associate a probability to

it[56]. These tools cleave every protein in the specified search database in silico according to

specific rules depending on the cleavage enzyme used for digestion and then they calculate the

theoretical mass for each peptide. Afterwards, the software computes a score based on the

probability that the peptides from a sample match those in the selected protein database and

derives the ones which better explain the observed peptides. Clearly, proteins with multiple

peptide matches have a much greater confidence in their assignment than proteins identified

only by one peptide[33]. Therefore, the outcome of a proteome analysis is usually a long list of

identified factors, associated to a probability score and ideally also to a quantitative value.

4. DATA INTERPRETATION AND INTEGRATION: once the proteomics analysis per se is finished and

the list of relevant proteins is ready, functional analysis is performed with the aim of revealing

pathways and interactions relevant to the biological question of interest. The first step of a

functional analysis is to connect the protein name to a unique identifier and then to its

associated Gene Ontology terms (http://www.geneontology.org). In this way proteins are

matched to their corresponding gene from which it is easier to infer the molecular pathway

involved. Alike metabolomics, enrichment and pathway analyses are commonly performed in

1. Introduction

order to link the list of significant proteins to the biological context under study. Proteins

involved in the chemical reaction and those that have regulatory influence are combined in so-

called pathway databases, such as KEGG, Reactome, Ingenuity Pathway Knowledge Base or

BioCarta. Almost all pathway database are equipped with software able to perform also

enrichment analysis, thereby from a unique tool is sufficient to extract data on the pathways

involved and on their abundance[58].

1.4 MOTIVATIONS AND OBJECTIVES OF THE STUDY

Despite significant improvements in clinical care, accurate diagnosis and risk stratification

for septic shock patients remains an important challenge. In fact, the early assessment of sepsis

severity is complicated by the highly variable and non-specific symptoms and clinical presentations

of this syndrome. Moreover, the choice of treatment is based only upon the traditional concept of

sepsis progression and corresponding clinical signs (i.e. organ hypoperfusion), thereby therapies are

not optimized for individual patients[7],[59].

The complex pathophysiology of sepsis suggests that a single biomarker approach cannot

adequately describe and stratify patients affected by this syndrome. Traditional biomarker

strategies, which implies the measurement of the concentration of a panel of circulating proteins

(e.g. C-reactive proteins, procalcitonine, cytokines, etc.) have not yielded a definitive set of

biomarkers, since they lacked the sensitivity and specificity to discriminate individual patient

prognoses and outcomes. Thereby, a comprehensive and integrated analysis of molecular

measurements and multiple clinico-pathologic data may facilitate an early and appropriate

therapeutic intervention. Integration of omics and clinical data may thus provide a means to follow

responsiveness to therapy, to establish new therapeutic targets, and finally to enable identification

of patients prone to specific therapies. Overall, early patient stratification may improve septic shock

outcome thanks to a prompt intervention with a tailored therapy[60],[61].

Although the so called omics technologies have been available for well over a decade,

technological advances in the field are continually increasing the feasibility and accessibility of this

kind of analyses, accompanied by a reduction of costs. In the last years, the interest in metabolomics

have increased since metabolites represents the terminal downstream products of the genome and

consists of the total complement of all low-molecular-weight molecules that cellular processes leave

behind. Since metabolites concentration levels vary as a consequence of genetic, physiological,

pathological or environmental changes, metabolomics studies can be applied in many different

1. Introduction

fields and are thus very useful to reveal molecular pathways and for identifying and quantifying

differentially expressed molecules, independently from multiple trigger factors causing the disease

under investigation. This aspect is very promising for complex and multifactorial syndromes, such

as septic shock, and makes metabolomics analyses a suitable starting point toward personalized

medicine in this area.

Personalized medicine refers to the tailoring of medical treatment to the individual

characteristics of each patient and it is a concept of therapeutic and preventive approach for

disease, which takes into account individual variability in genes, environment and lifestyle[62]. All

these factors cannot be synthesized in a single omics analysis. Only a multilevel approach, which

integrates clinical measurements and different omics data, could elucidate the complex

pathophysiological mechanisms of a disease and may thus provide a more precise picture of the

pathways involved. It is worth to underline that high-throughput omics approaches generate a huge

amount of data, giving origin to very complex and heterogeneous datasets, which cannot be

analyzed using traditional statistical analysis, hence the increasing interest towards data mining and

machine learning.

This thesis deals with the exploration of machine learning and data mining techniques for

metabolomics data analysis and multilevel integration in two septic shock patients’ cohorts. W

focused on a homogeneous and well defined group of patients in the same condition (i.e. severe

septic shock) and on a short temporal window (i.e. 48 hours or one week after diagnosis). The first

cohort is constituted by a subset of 20 patients of the ALBIOS database (Albumin Italian Outcome

Sepsis study, NCT00707122)[63], a multicenter clinical trial which enrolled patients with severe

sepsis or septic shock from 100 ICU in Italy. The second cohort is constituted by 21 septic shock

patients from ShockOmics dataset (NCT02141607)[64].

Primary objective of this thesis is the development of a strategy to analyze and integrate

omics data using data mining approaches in order to identify changes in metabolite patterns

associated to septic shock progression and patient response to treatment. Like in other omics study,

the number of features, e.g. hundreds of metabolites concentration, is much higher than the

number of observations, i.e. the number of patients. Therefore, the novelty and worthy of this work

is the development of suitable and reliable strategies to cope with this situation: different data

mining techniques, studied ad hoc for the specific scientific question and for the kind of data

considered, were explored.

1. Introduction

The models developed enabled to highlight the molecular pathways involved and to suggest

a list of candidate biomarkers so to intervene on treatment design to lower the mortality risk.

Indeed, the importance of this study is that it could be useful for physicians in septic shock

management and in the design of a therapy tailored on each patient. Moreover, the models

obtained by our data mining approach highlighted some species and pathways, which could help in

understanding the complex mechanisms, currently still under study, involved in the pathogenesis

and progression of septic shock.

1.4.1 Thesis outline

This thesis is organized into six chapters, including two introductory chapters and one of

conclusions, future directions and clinical impact of the work. Three appendixes complete the

dissertations giving further details on metabolomics and proteomics data analyses.

Chapter 1 illustrated the pato-physiological background, giving particular emphasis on septic

shock and omics data. Chapter 2 provides and overview of the data mining techniques applied for

our analyses. Chapter 3 and 4 present the analyses performed on a cohort of patients from ALBIOS

database (Albumin Italian Outcome Sepsis study, NCT00707122)[63]. Chapter 5 describes the

analyses performed on septic shock patients from ShockOmics clincila trial (NCT02141607)[64].

Chapter 6 summarizes the results, illustrates the clinical impact of the work and outlines the future

steps. The appendixes contain additional information about analytical protocols followed to

perform targeted and untargeted metabolomics analyses, details on proteomics data from multi-

iTRAQ experiments, and an explorative analysis on qualitative comparison of cardiogenic and septic

shock patients. The last appendix reports the complete list of the publications of the PhD candidate.

2 DATA MINING METHODS FOR OMICS DATA ANALYSIS AND

INTEGRATION

The main objective of traditional statistical methods is data exploration by considering each

feature as independent from the other variables of the dataset. Therefore, with a statistical test it

is possible to calculate whether confidence in a hypothesis exceeds a significance level, based solely

on a sample-based estimate. They do not allow for a general and valid categorization of selected

variables or, as in our case, for biological and functional interpretation. As an attempt to circumvent

this problem, data mining and alternative feature reduction methods constitute a better strategy

for selecting and prioritizing variables[65]. By considering each features in relation with all the

others, data mining approaches enable in fact to extract previously unsuspected information or

patterns from complex databases, such as the omics ones. It is important to underline that data

mining methods are not meant to replace classical statistical tests, but the two approaches should

be used in a complementary fashion: the former can be considered as hypothesis-generating

methods, while the latter can be used for hypothesis testing[66].

In this chapter, the machine learning techniques employed are reviewed and briefly

described. Firstly, a concise introduction about linear regression methods will be given, followed by

a more detailed explanation of logistic methods for regression, specifically of the elastic net

technique. Afterwards, an overview on the feature reduction methods present in literature will be

done. Particular emphasis will be given to the minimum-redundancy maximum-relevance (mRMR)

algorithm which has been applied in this study. Linear classification methods, specifically Linear

Discriminant Analysis (LDA) and Partial Least Squares Discriminant Analysis (PLS-DA), will be

described afterwards. Finally, we will briefly outline probabilistic graphical models.

2.1 THE MULTIPLE TESTING PROBLEM

Omics datasets are characterized by the so-called “course of dimensionality problem”, which

arises when the number of features p is much higher than the number of observations n. In this

situation, the number of statistical tests to perform increases and, as a consequence, also the

probability of wrongly rejecting the null hypothesis (type I error) increases. In other words, when

setting a p-value threshold of, for example, 0.05, there is a 5% chance that the result is a false

2. Data mining methods for omics data analysis and integration

positive, i.e. although results are statistically significant there is actually no difference in the group

means. While 5% is acceptable for one test, if we do lots of tests on the data, then this 5% can result

in a large number of false positives. This is known as the multiple testing problem. Type I errors are

particularly undesirables in omics studies, as false findings may seriously affect the outcome. Two

are the possible ways of handling this problem: the Bonferroni or the False Discovery Rate (FDR)

corrections. The Bonferroni correction is the classical approach used for the multiple testing

problem. Instead of setting the critical p-value for significance to 0.05, a lower critical value is used,

obtained by dividing the p-value by the number of comparisons. For instance, if the features are 100

and thus the tests to perform are 100, the critical value for an individual test would be

0.05/100=0.0005, thereby only features for which the p-value<0.0005 are considered significant. It

is evident that in case of high-dimensional datasets this condition is too conservative, in the sense

that while it reduces the number of false positives, it also reduces the number of true

discoveries[67].

Also the FDR approach determines adjusted p-values for each test, but it controls the

number of false discoveries only in those tests which are significant. This is different from the

Bonferroni correction, which controls all falsely rejected hypotheses. Because of this, the FDR

approach is less conservative and has greater ability to find truly significant results. The FDR

correction calculates a p-corrected value, called q-value, for each tested feature. This q-value is a

function of the p-values and the distribution of the entire set of p-values from the family of tests

being considered. For each feature, its associated q-value can be seen as the expected proportion

of false positives considered when such feature is declared to be significantly different. Hence, a

features having a q-value of 0.05 implies that 5% of features showing p-values as small as such

feature are false positives. Specifically, a p-value of 0.05 implies that 5% of all tests will result in false

positives and a q-value of 0.05 means that 5% out of the significant tests will result in false

positives[36], [67]. Thus, when imposing a significance threshold, both the p-value and the FDR

correction should be taken into account.

2.2 LINEAR AND LOGISTIC REGRESSION MODELS

A linear regression model assumes that the regression function y=f(x) is linear in the inputs

x1, … xn, and that y is a continuous variable. The aim of regression models is to identify the function

f, which expresses the relationship between the target variable y, also called dependent variable or

outcome, and n explanatory variables xn, also termed independent variables or predictors. To

achieve a more convincing and sound interpretation, the functional relationship between the

dependent and independent variables, mathematically represented by f, should be of casual nature,

i.e. it should express a cause-effect nexus, with the independent variables xn playing the causal role

and the dependent variable y being the effect.

A linear regression model assumes that, given n input variables x1, … xn, the response 𝑦𝑦 is

predicted or estimated by the linear function:

𝑦𝑦 = 𝛽𝛽0 + 𝑥𝑥1𝛽𝛽1 + ⋯ + 𝑥𝑥𝑛𝑛𝛽𝛽𝑛𝑛 (2.1)

where the 𝛽𝛽 = (𝛽𝛽0, … , 𝛽𝛽𝑛𝑛) are unknown parameters or coefficients of the model, produced by the

model fitting procedure. The most popular estimation method is the ordinary least squares (OLS),

in which the coefficients are obtained by minimizing the residual sum of squares [68]:

𝛽𝛽𝑂𝑂𝑂𝑂𝑂𝑂 = ∑ �𝑦𝑦𝑖𝑖 − 𝛽𝛽0 − ∑ 𝑥𝑥𝑖𝑖𝑖𝑖𝛽𝛽𝑖𝑖𝑛𝑛𝑖𝑖=1 �𝑁𝑁

𝑖𝑖=12 (2.2)

where the regression coefficients 𝛽𝛽𝑖𝑖 represent the change in y, given one unit change in x. The input

variables are usually normalized, i.e. centered to have mean 0 and scaled to have standard deviation

1 (i.e. Z-score normalization). In this way the coefficients can be interpreted as weight.

The goal of linear regression models is twofold. On the one hand, they highlight dependency

of the target variable on the predictors, thus enabling a functional and/or causal interpretation and

this was precisely the objective of our metabolomics data analyses. On the other hand, they can

also be used to predict the future value of the target attribute, based upon the functional

relationship identified and upon the future values of the explanatory attributes. Therefore, the

development of a regression model allows to achieve a deeper understanding of the phenomenon

under study and to evaluate the effects determined on the target variable by different combinations

of values assumed by the predictors[69].

In spite of their simplicity, linear models provide an adequate and interpretable description

of how the inputs xn affect the output y. For prediction purposes, they outperform more complex

nonlinear models in some particular situations, e.g. with small numbers of training cases, low signal-

to-noise ratio or sparse data, thus they are broadly used in many different fields[68].

When the dependent variable y is categorical or binomial, for example in a classical

dichotomous problem (sick/healthy, dead/alive, etc.), we can consider f(x) as a reasonable estimate

of the posterior probabilities Pr(G = 1|X = x). However, f(x) can be negative or greater than 1, and

typically some are. These violations in themselves do not guarantee that this approach will not work,

and in fact on many problems it gives similar results to more standard linear methods for

classification. If we allow linear regression onto basis expansions h(x) of the inputs, this approach

can lead to consistent estimates of the probabilities. However, logistic regression is to be

preferred[68].

The logistic regression model arises from the desire to model the posterior probabilities of

the K classes via linear functions in x, while at the same time ensuring that they sum to one and

remain in [0, 1]. The model is specified in terms of (K – 1) log-odds or logit transformations,

reflecting the constraint that the probabilities sum to one. In case of dichotomous problem (K=2

classes) the model can be simplified to one equation only:

𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙(𝑝𝑝) = ln � 𝑝𝑝1−𝑝𝑝

� = 𝛽𝛽0 + 𝑥𝑥1𝛽𝛽1 + ⋯ + 𝑥𝑥𝑛𝑛𝛽𝛽𝑛𝑛 (2.3)

where p is the probability of the presence of the characteristic or class of interest. The logit

transformation is defined in terms of the log-odds as:

𝑙𝑙𝑜𝑜𝑜𝑜𝑜𝑜 = 𝑝𝑝1−𝑝𝑝

= probability of the presence of the characteristicprobability of the absence of the characteristic

Because the dependent variable is not a continuous one, the goal of logistic regression is

different from the one of linear regression. In fact, it predicts the likelihood that y is equal to 1 given

certain values of x. That is, if x and y have a positive linear relationship, then the probability that an

observation will have a score of y = 1 will increase as the value of x increases. So, rather than

choosing the parameters that minimize the sum of squared errors (like in ordinary regression),

estimation in logistic regression selects the parameters that maximize the likelihood of observing

the sample values. Because of the use of the logit function, the logistic regression coefficients are

not as easy to interpret; thus they are translated to the so called odd ratios using the exponent

function. The odd ratios are equal to 𝑒𝑒𝛽𝛽, where an odd ratio of 1 (i.e. β=0 ) indicates there is no

relationship between x and y [68].

2.2.1 Shrinkage methods

When considering linear regression models, according to the Gauss-Markov Theorem, the

OLS coefficients are the best linear unbiased estimators. This means that, among all linear estimates

with no bias, they have the smallest variance, and thus the smallest mean squared error (MSE).

However, there exist biased estimators with smaller MSE which trade a little bias for a larger

reduction in variance and thus perform better in presence of many correlated variables. In this case

in fact, the 𝛽𝛽𝑂𝑂𝑂𝑂𝑂𝑂 can become poorly determined and exhibit high variance: a large positive

coefficient on one variable can be cancelled by a similarly large negative coefficient on its correlated

counterpart. This issue can be partially mitigated by imposing a size constraint on the 𝛽𝛽, applying

methods which shrink the regression coefficients by imposing a penalty on their size. For example,

ridge regression[70] is a continuous shrinkage method which minimizes the residual sum of squares

by applying a penalty on the L2-norm1 of the regression coefficients:

𝛽𝛽𝑟𝑟𝑖𝑖𝑟𝑟𝑟𝑟𝑟𝑟 = argmin𝛽𝛽

�∑ �𝑦𝑦𝑖𝑖 − 𝛽𝛽0 − ∑ 𝑥𝑥𝑖𝑖𝑖𝑖𝛽𝛽𝑖𝑖𝑛𝑛𝑖𝑖=1 �𝑁𝑁

𝑖𝑖=12 + 𝜆𝜆 ∑ |𝛽𝛽𝑖𝑖|2𝑛𝑛

𝑖𝑖=1 � (2.5)

where λ is a complexity parameter that controls the amount of shrinkage[68].

Even though ridge regression generally achieves better prediction performances than OLS, it cannot

produce a parsimonious model since it always keeps all the predictors.

To have an easier interpretable model, a technique called the lasso (“least absolute

shrinkage and selection operator”) was proposed by Tibshirani and colleagues[71]. The lasso shrinks

some coefficients and sets others to zero, thus performing at the same time subset selection: only

the variables mainly associated with the outcome obtain a non-null coefficient and thus are included

in the model. The lasso imposes a L1-norm 2 penalty on the regression coefficients which are

assessed by:

𝛽𝛽𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 = argmin𝛽𝛽

𝑖𝑖=12 + 𝜆𝜆 ∑ |𝛽𝛽𝑖𝑖|𝑛𝑛

𝑖𝑖=1 � (2.6)

where the only difference between the ridge regression and lasso is the penalty term (L2 and L1

norm respectively). Although the lasso is successful in many situations, it has two main

limitations[72]:

• the total number of variables p that the lasso can select is bound by the total number of

observations n in the dataset, e.g. in the case 𝑝𝑝 ≫ 𝑛𝑛, the lasso selects at most n variables before

it saturates;

• it fails to perform group selection; if there is a group of highly correlated variables, the lasso

tends to select only one variable and ignores the others.

1 The L2-norm, also called Euclidean norm or distance, is expressed as ‖𝛽𝛽‖2 = �𝛽𝛽1

2 + ⋯ + 𝛽𝛽𝑛𝑛2

2 The L1-norm is defined as ‖𝛽𝛽‖1 = |𝛽𝛽1| + ⋯ + |𝛽𝛽𝑛𝑛|

These issues make the lasso an inappropriate variable selection method for situation of grouped

variables and when 𝑝𝑝 ≫ 𝑛𝑛, as it is often the case in datasets containing biological data (e.g. gene

microarray analysis, proteomics or metabolomics).

To solve the problems highlighted above, Zou and colleagues proposed a new regularization

technique, called the elastic net[72]. Similarly to the lasso, the elastic net simultaneously performs

automatic variable selection and continuous shrinkage, but it can also select groups of correlated

variables. The elastic net estimator 𝛽𝛽𝑟𝑟𝑙𝑙 𝑛𝑛𝑟𝑟𝑛𝑛 is the minimizer of equation:

𝛽𝛽𝑟𝑟𝑙𝑙 𝑛𝑛𝑟𝑟𝑛𝑛 = argmin𝛽𝛽

𝑖𝑖=12 + 𝜆𝜆2 ∑ |𝛽𝛽𝑖𝑖|2𝑛𝑛

𝑖𝑖=1 + 𝜆𝜆1 ∑ |𝛽𝛽𝑖𝑖|𝑛𝑛𝑖𝑖=1 � (2.7)

where the elastic net penalty is a combination of the ridge and the lasso ones. More precisely, the

elastic-net selects variables like the lasso, and shrinks together the coefficients of correlated

predictors like the ridge.

The penalty term for the three models can then be expressed as follows:

𝑃𝑃𝛼𝛼 = ∑ �12

(1 − 𝛼𝛼)𝛽𝛽𝑖𝑖2 + 𝛼𝛼|𝛽𝛽𝑖𝑖|�𝑛𝑛

𝑖𝑖=1 (2.8)

For 𝛼𝛼 = 0 we obtain the ridge penalty, for 𝛼𝛼 = 1 the lasso, and for 0 < 𝛼𝛼 < 1 the elastic net. In this

latter case, the closer 𝛼𝛼 is to 0, the more rigid is the model. The algorithm here presented is the so

called naïve elastic net, which does not perform satisfactorily, unless it is very close to either ridge

regression (𝛼𝛼~0) or the lasso (𝛼𝛼~1). In fact, the parameters are penalized twice with the same 𝛼𝛼:

this double shrinkage does not decrease variance and introduces extra bias. This issue can be

overcome by imposing:

𝑃𝑃𝛼𝛼 = (1 − 𝛼𝛼)𝛽𝛽𝑖𝑖 + 𝛼𝛼|𝛽𝛽𝑖𝑖2| where 𝛼𝛼 = 𝜆𝜆2

(𝜆𝜆1+𝜆𝜆2) (2.9)

Thus, we obtain the elastic net estimators as:

𝛽𝛽𝑟𝑟𝑙𝑙 𝑛𝑛𝑟𝑟𝑛𝑛 = (1 + 𝜆𝜆2)𝛽𝛽𝑟𝑟𝑙𝑙 𝑛𝑛𝑙𝑙𝑖𝑖𝑛𝑛𝑟𝑟 (2.10)

The elastic net produces a sparse model with good prediction accuracy, while encouraging a

grouping effect. The empirical results demonstrate that the elastic net not only has good

performances but they are also superior to the lasso ones, particularly when dealing with several

correlated group variables, as in our datasets.

The lasso and elastic net techniques can be used for variable selection and shrinkage with

any linear regression model. For the logistic regression, the lasso penalty can be written as:

𝛽𝛽𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 = argmin𝛽𝛽

�1𝑁𝑁

𝐷𝐷𝑒𝑒𝐷𝐷(𝛽𝛽0, 𝛽𝛽) + 𝜆𝜆 ∑ |𝛽𝛽𝑖𝑖|𝑛𝑛𝑖𝑖=1 � (2.11)

where the Dev is the deviance of the model fit to the responses using as intercept β0 and as

predictor coefficients β. The bigger is the deviance of the observed values from the expected ones,

the poorer the fit of the model.

As for the elastic net, i.e. for α strictly between 0 and 1, we obtain:

𝛽𝛽𝑟𝑟𝑙𝑙 𝑛𝑛𝑟𝑟𝑛𝑛 = argmin𝛽𝛽

�1𝑁𝑁

𝐷𝐷𝑒𝑒𝐷𝐷(𝛽𝛽0, 𝛽𝛽) + 𝜆𝜆𝑃𝑃𝛼𝛼(𝛽𝛽)� (2.12)

𝑃𝑃𝛼𝛼(𝛽𝛽) = (1−𝛼𝛼)2

‖𝛽𝛽‖22 + 𝛼𝛼‖𝛽𝛽‖1 = ∑ �(1−𝛼𝛼)

2𝛽𝛽𝑖𝑖

2 + 𝛼𝛼�𝛽𝛽𝑖𝑖��𝑛𝑛𝑖𝑖=1 (2.13)

In this work, the elastic net model was developed in Matlab® environment with the lasso

and lassoglm library, available in the Statistics and Machine Learning Toolbox.

2.3 FEATURE REDUCTION

The purpose of feature reduction (or feature selection) is to eliminate from the dataset a

subset of uninfluential attributes, which are not suited to accurately explain the investigated

phenomenon. The key concept of feature selection is that correlated variables provide no extra

information about the classes, thereby they constitute noise for the predictor and are possible

source of bias. This implies that the total information content can be obtained only from fewer

unique features, which have maximum discrimination information about the classes[73]. Although

the elastic net or lasso penalty performs feature selection, reducing the number of variables before

building the model can have several advantages. Indeed, not only the computational time of the

learning algorithm decreases, but also the models generated are more robust, accurate and easier

to understand[69]. Moreover, feature reduction is extremely useful when a model is affected by

multicollinearity, as it often happens with biological variables. In fact, if we have high collinearity

and a condition where 𝑝𝑝 ≫ 𝑛𝑛 , the algorithm for the coefficients estimate can fail, the overall

significance of the model is compromised and the estimate of the regression coefficient can be

inaccurate. In case of linear regression, the multicollinearity and 𝑝𝑝 ≫ 𝑛𝑛 condition do not permit to

compute the hat matrix (H = X(XT X)−1 XT and 𝑦𝑦� = X β = Hy ) as X is not invertible.

To remove redundant features, a criterion which measures the relevance of each feature

with the output class must be applied. Several methods have been developed for this purpose and

a brief overview of the main ones is provided in the follow.

According to the literature, feature reduction algorithms can be broadly classified into filter,

wrapped and embedded methods. Filter methods are named after the fact that they are applied

before classification in order to “filter out” less relevant attributes. They perform features selection

by ordering features according to their relevance: a suitable ranking criterion is applied to score the

model variables and a threshold is set in order to remove the variables which fall below it. In this

way, only top ranked features are selected and used for prediction. Feature relevance, i.e. the

usefulness of a feature in discriminating the different classes, can be measured according to several

different criteria, among which the most commonly used are conditional independence, correlation

and mutual information[74]. Filter methods are simple and robust against overfitting, but the best

features subset selected may not be unique or the optimal one[73].

Unlike filter methods, in wrapper methods the predictor is “wrapped” on a specific search

algorithm, which selects the features with the highest predictor performance. Since evaluating all

the 2n subset of features is a NP-hard problem, suboptimal subsets have to be found heuristically.

Two are the classes of algorithms used for this purpose: sequential selection and heuristic search

algorithms. Sequential selection algorithms are iterative: they start with an empty set (or full set)

and, at each iteration, they add (or remove) a feature according to its classification accuracy. At

each step, the new subset is evaluated according to the predictor performance and the process is

repeated until the required number of features is reached. Heuristic search algorithms instead

evaluate different features subsets in order to optimize the predictor performance. Different

features subsets can be generated either by searching around the features space or by generating

new solutions, e.g. by randomly selecting without replacement the variables in the dataset.

Although wrapper methods could in principle find the best feature subset, they are prone to

overfitting[73],[74].

In embedded methods feature selection is incorporated in the process of building the model,

thus feature reduction and learning are not two separate stages but they continuously interact. As

a results, these methods simultaneously attempt to select features and determine model

parameters[75],[76]. Embedded methods can be further divided into three categories: build-in,

pruning and regularization methods. The first set of models have a build-in mechanism for feature

selection and are represented by classification and regression trees. Pruning methods train a model

with all features, then they attempt to remove some of the features by setting the associated

coefficients to 0, while keeping the model performance. Some examples are support vector machine

and nearest shrunken centroids. The regularization methods aim to minimize the fitting error and

in the meantime to force the coefficients to be small. The coefficients which are close to 0 are then

removed[75].

Every family of feature selection methods (i.e. filter, wrapper and embedded) has its own

advantages and drawbacks, therefore they have to be chosen according to the problem under

investigation and the nature of the data being analyzed. In general, filter methods are fast, since

they do not incorporate learning; wrapper methods are slower than filter, since they evaluate the

model performance at each iteration. Embedded methods tend to have higher capacity than filter

methods and are therefore more likely to over fit. Filter methods usually perform better when the

training set is small, as in our datasets, whereas embedded methods outperform filter methods

when the number of observations is much higher than the number of features[76].

2.3.1 The minimum-redundancy maximum-relevance (mRMR) algorithm

The feature reduction approach adopted in this study was the minimum-redundancy

maximum-relevance (mRMR), a filter method based on mutual information (MI) [77]. As the name

suggests, this algorithm combines two criteria, the maximal relevance (Max-Relevance) and minimal

redundancy (Min-Redundancy). The Max-Relevance selects the features with the highest relevance

to the target, estimated in terms of MI. Given two random continuous variables x and y, their mutual

information 𝐼𝐼(𝑥𝑥, 𝑦𝑦) is defined in term of their probabilistic density function as:

𝐼𝐼(𝑥𝑥, 𝑦𝑦) = ∬ 𝑝𝑝(𝑥𝑥, 𝑦𝑦) log 𝑝𝑝(𝑥𝑥,𝑦𝑦)𝑝𝑝(𝑥𝑥)𝑝𝑝(𝑦𝑦)

𝑜𝑜𝑥𝑥𝑜𝑜𝑦𝑦 (2.14)

MI is zero if x and y are independent, greater than zero if they are dependent. Given a feature subset

S with n features xi, and a target class c, the Max-Relevance criterion consists in searching features

which satisfy:

max𝐷𝐷

(𝑆𝑆, 𝑐𝑐), 𝐷𝐷 = 1|𝑂𝑂|

∑ 𝐼𝐼(𝑥𝑥𝑖𝑖; 𝑐𝑐)𝑥𝑥𝑖𝑖 ∈ 𝑂𝑂 (2.15)

The features with the highest MI value are the most correlated to the outcome. However, it has

been recognized that the combination of the individually most relevant features does not

necessarily lead to good classification performances, thereby this criterion alone is not enough to

select the optimal features subset. It is likely that variables selected according to the Max-Relevance

criterion are redundant, i.e. with the same information content of the others. Therefore, in spite of

being strongly dependent on the target class, they do not add any meaningful information to the

model and, if one of them is removed, the class-discriminative power of the dependent variable

does not change. A criterion to reduce redundancy has thus been introduced to select mutually

exclusive features:

min𝑅𝑅

(𝑆𝑆), 𝑅𝑅 = 1|𝑂𝑂|2 ∑ 𝐼𝐼(𝑥𝑥𝑖𝑖; 𝑥𝑥𝑖𝑖)𝑥𝑥𝑖𝑖,𝑥𝑥𝑗𝑗 ∈ 𝑂𝑂 (2.16)

Redundancy is thus expressed as high values of MI among the features. The minimum redundancy

maximum relevance (mRMR) criterion combines the two constraints 2.15 and 2.16 according to:

(𝐷𝐷, 𝑅𝑅), Φ = 𝐷𝐷 − 𝑅𝑅 (2.17)

For our analyses we used a two-stage approach. In the first stage, we found a smaller set of

candidates features by applying the mRMR algorithm. Then, we used the selected features to build

the models with the elastic net technique. This allowed us to select a compact set of superior

features and to obtain models with very good accuracies. The source codes implemented for mRMR,

developed by Peng[77], are freely available.

2.4 LINEAR METHODS FOR CLASSIFICATION

Linear classification methods deal with the problem of finding the best sub-division in the

variables domain so to asses to which group a sample is most likely to belong to. More precisely, in

a classification problem every observation xn in the data matrix X is associated to a qualitative

outcome (or class) yi, which can take only values from a discrete set G. Since also the predictors

𝐺𝐺�(𝑥𝑥) takes values in G, the input space can always be divided into a collections of regions labeled

according to the classification. Depending on the prediction function, the boundaries of these

regions can be smooth or rough. For an important class of procedures, called linear methods for

classification, these boundaries are linear[68]. There are several specific algorithms useful for

classification problems; here we briefly describe Linear Discriminant Analyses (LDA) and Partial Least

Squares Discriminant Analyses (PLS-DA) which have been applied in this study. Both methods were

implemented in Matlab® environment; LDA with the fitcdiscr function, available in the Statistics and

Machine Learning Toolbox, whereas the source codes implemented for PLS-DA, developed by Li et

al.[78], are freely available.

2.4.1 Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) is a supervised classification method based upon the

concept of searching for a linear combination of variables (predictors) that best separates two

classes. The goal of LDA is to project a feature space (i.e. a n-dimensional dataset of n samples) onto

a smaller subspace k (where k ≤ n−1) while maintaining the class-discriminatory information and

avoiding overfitting by minimizing the error in parameter estimation. If we consider a two-class

classification problem, we can define separability by mean of a score function S:

𝑆𝑆(𝛽𝛽) = 𝛽𝛽𝑇𝑇𝜇𝜇1− 𝛽𝛽𝑇𝑇𝜇𝜇2𝛽𝛽𝑇𝑇𝐶𝐶𝛽𝛽

(2.18)

where β are the coefficients of the linear model and 𝜇𝜇1, 𝜇𝜇2 are the mean vectors. Given the score

function S, the problem is to estimate the linear coefficients which maximize the score and it can be

solved as follows:

𝛽𝛽 = 𝐶𝐶−1(𝜇𝜇1 − 𝜇𝜇2) (2.19)

with C being the pooled covariance matrix, expressed as:

𝐶𝐶 = 1𝑛𝑛1+𝑛𝑛2

(𝑛𝑛1𝐶𝐶1 + 𝑛𝑛2𝐶𝐶2) (2.20)

where 𝐶𝐶1, 𝐶𝐶2 are the covariance matrixes and 𝑛𝑛1, 𝑛𝑛2 is the number of samples in the two classes.

To assess the effectiveness of the discrimination, one way is to calculate the Mahalanobis distance

∆ between two groups:

∆2= 𝛽𝛽𝑇𝑇(𝜇𝜇1 − 𝜇𝜇2) (2.21)

where a distance greater than 3 means that the two averages differ by more than 3 standard

deviations. This implies that the overlap (i.e. the probability of misclassification) is quite small.

Two are the main limitations of LDA. First, the number of variables needs to be less than the

number of observations; hence, when the number of variables exceeds the number of samples,

feature reduction has to be performed before applying this algorithm. The other disadvantage is

that LDA does not take into account the differing variance structure of each group since it only used

a single pooled covariance matrix C. This approach is not appropriated for a dispersed group, for

which a relatively large distance from a mean may be less significant than for a compact one.

We want to remark that with two classes, there is a simple correspondence between linear

discriminant analysis and classification by linear least squares. In fact, if we suppose to code the

targets in the two classes as +1 and −1 respectively, it is easy to show that the coefficient vector

from least squares is proportional to the LDA direction[79].

2.4.2 Partial Least Squares Discriminant Analysis (PLS-DA)

Partial least squares discriminant analysis (PLS-DA) is a supervised method mainly used for

classification purposes, i.e. to determine to which group a sample is most likely to belong to, given

a set of measurements. In the two-dimensional case, the PLS-DA algorithm can be regarded as a

linear two-class classifier which aims to find a straight line that divides the space into two regions.

When there are more than two variables, the decision function is represented by a hyperplane in

the multidimensional space. To optimize the separation between groups, PLS-DA links the two data

matrixes X, raw data, and Y, class membership, by maximizing the covariance between them. The

original variables are thus summarized into fewer new ones (called PLS components or latent

variables) with the constrain to explain as much as possible the covariance between X and Y [80],

The PLS-DA algorithm is derived from PLS regression and involves building a regression

model between X and Y by decomposing the two matrixes in a product of a common set of specific

loadings. The fundamental PLS-DA equations are the following[79]:

𝑋𝑋 = 𝑇𝑇𝑃𝑃 + 𝐸𝐸 (2.22)

𝑌𝑌 = 𝑇𝑇𝑇𝑇 + 𝑓𝑓 (2.23)

where T is the common score matrix, i.e. it contains the projections of X and Y onto the hyperplane

representing the maximum covariance between them. P and q are the loadings matrixes which

contains the directions of the hyperplane with respect to X and Y variables. Thereby, the loadings

matrixes give information about how each variable influence the model. Finally, E and f are the

residuals, i.e. the errors of X and Y left unaccounted by the model. The PLS-DA algorithm works as

follows:

1. Calculate the PLS weight vector w

𝑤𝑤 = 𝑋𝑋𝑇𝑇𝑌𝑌 (2.24)

2. Calculate the scores, given by

𝑇𝑇 = 𝑋𝑋𝑋𝑋�∑ 𝑋𝑋2 (2.25)

3. Calculate the X loadings by

𝑃𝑃 = 𝑇𝑇𝑇𝑇𝑋𝑋∑ 𝑇𝑇2 (2.26)

4. Calculate the Y loadings (a scalar)

𝑇𝑇 = 𝑌𝑌𝑇𝑇𝑇𝑇∑ 𝑇𝑇2 (2.27)

5. Subtract the effect of the new PLS component from the data matrix to obtain a residual data

matrix

𝐸𝐸 = 𝑋𝑋 + 𝑇𝑇𝑃𝑃 (2.28)

𝑓𝑓 = 𝑌𝑌 + 𝑇𝑇𝑇𝑇 (2.29)

Score and loadings are developed in a way that the first score of X has the maximum

covariance with the first one of Y and so on. Variable importance in a PLS-DA model can be measured

by the Variable Importance in Projection (VIP) score, which expresses the contribution of each

variable in the model. The VIP score of a variable is calculated as a weighted sum of the squared

correlations between the PLS-DA components and the original variable and the weights correspond

to the percentage of variation explained by the PLS-DA component in the model. Once a model is

built, it is possible to predict the value of Y both for the original data and for new samples.

It is worth to underline that when the group sizes are unequal, such as in our cases, the

decision boundary is shifted towards the larger group and consequently many samples can be

misclassified. Thereby, the solution is to weight the center of X by subtracting the average of the

means of the two groups (𝑋𝑋𝐴𝐴�� − 𝑋𝑋𝐵𝐵��)/2 from the columns so to shift away the decision boundary

from the larger group.

The PLS-DA approach is widely used in omics data analysis since it can handle highly collinear

and noisy data, which are very common outputs of this kind of studies. In addition, it also provides

loading weights and VIP scores, which can be used to identify the most important variables. Results

are shown in low dimensional score plots which illustrate the separation between groups in an easily

interpretable way. Moreover, comparison of loadings and score plot enables to investigate

relationship important variables that can be specific in the group of interest. This aspect is

fundamental in field such as metabolomics, in which we are interested not only in deciding to which

group a sample belongs to, but also to asses which variables are best discriminators.

In spite of these advantages, the PLS-DA algorithm has the tendency to over-fit, especially

when the number of variables significantly exceed the number of observations. This is due to the

fact that some correlations can be found just by chance, given the high number of variables. A way

to avoid overfitting is to split samples into training and test sets or, as an alternative when the

number of samples is too low, to perform variables reduction[79],[81].

2.5 PERFORMANCE EVALUATION

Within a classification analysis it is usually advisable to validate the model on an independent

dataset. To do so, the dataset is divided into two smaller subsets: the training and testing set. As

the names suggest, the training set is used for training the classification model, that is for deriving

the functional relationship between the target variable and the explanatory variables. What remains

of the available data, i.e. the testing set, is used later to evaluate the performance of the generated

model or to select the best model out of those developed using alternative classification methods.

To guarantee that each observation appears the same number of times in the training and in the

testing set, the cross-validation (CV) method can be applied to partition the dataset. The dataset is

divided into k subsets and, at each time, one of the k subsets is used as the test set and the other k-

1 subsets are used as a training set. The procedure is repeated k times using each of the k training

sets in turn and evaluating the model performance each time on the corresponding test set. At the

end of the procedure, the overall accuracy is computed as the average of the k individual

performances. To reduce variability, multiple rounds of cross-validation are performed using

different partitions (different k), and the validation results are averaged over the rounds[69].

In a binary classification as in our case, i.e. when we have two classes only, for each instance

in the test set, a model prediction is expressed as the probability of being in the case class with a

value in the range [0,1]. Two strategies were used for performance evaluation: confusion matrixes

and Receiver Operating Characteristic (ROC) curve.

In a confusion matrix, the columns are the predicted class and the rows are the actual class.

The performance of a model is evaluated by means of the accuracy, i.e. the proportion of true

classification (True Negative, TN and True Positive, TP) among the total number of cases examined.

When data are imbalanced, predictive accuracy of a confusion matrix may not be

appropriate, thus ROC curves may be used instead. A ROC curve is a standard technique for

summarizing classifier performance over a range of trade-offs between TP, the number of positive

examples correctly classified, and false positive (FP), the number of negative examples incorrectly

classified as positive, error rates. An accepted performance metric for a ROC curve is the Area Under

the Curve (AUC) as it is independent from the decision criterion selected and prior probabilities[82].

2.6 PROBABILISTIC GRAPHICAL MODELS

The framework of probabilistic graphical models is quite broad and encompasses a variety

of different types of models and of methods related to them. The aim of probabilistic graphical

models is to capture the underlying probabilistic relations between variables of interest and to

express the underlying set of conditional independence (CI) assumptions via a graph structure.

Let P(x) be a joint probability distribution of n discrete variable (x1, … xn), the aim is to find

an approximation of this distribution by taking advantage of the properties of statistical

independence, i.e. P(xi|xj) =P(xi) if xi and xi are statistically independent.

The basic idea is that variables are statistically dependent only on very few others. These

interactions can be represented as a network, thus providing an easily interpretable picture of the

complex statistical relationships that exist among the domain variables[83], [84]. Graphical models

are used in many different fields of knowledge for diagnosis, prediction, classification and decision

making. They have also found a wide application in medicine and biology since they can be used to

predict and model the molecular basis of complex disease states.

An example of a simple graphical model is shown in Figure 2.1. Variables in the domain are

modeled as random variables and represented as nodes; the edges connecting them represents a

conditional dependence of the child node on the parent node. Thus, the absence of an edge

connecting two nodes implies that the two corresponding variables are conditional independent,

given the other variables. In addition to the graph structure, each node is annotated with the

conditional distribution of the variable given the values of its parents, and this information can be

used to infer the most probable values of variables in the network, given assignments to other

variables[84].

Figure 2.1 - Example of a simple graphical model. The nodes represent the variables A, B, C and D, whereas the edges denote the conditional dependence among them (e.g. the node A and B are conditional independent, given the other variables).

Two are the main families of graphical representations: the Bayesian Networks (BN) which

use directed graphs (i.e. the edges are directed since they have a source and a target) and the

Markov Networks (MN), also called Markov Random Field, which uses undirected graphs[83]. For

both methods, two are the main steps to perform an analysis: structure learning and inference.

Structure learning consists in building the network, whereas inference is the process of computing

the consequences of the network for outcome prediction[84].

The graphical structure of a BN is a Directed Acyclic Graph (DAG), in which a conditional

probability distribution is associated to each node N and described as

𝑃𝑃(𝑁𝑁𝑖𝑖|𝑃𝑃𝑃𝑃(𝑁𝑁𝑖𝑖)) (2.30)

This implies that the node 𝑁𝑁𝑖𝑖 is conditionally independent of its non-descendants, given its

immediate parents 𝑃𝑃𝑃𝑃(𝑁𝑁𝑖𝑖). The joint distribution of the nodes is given by:

𝑃𝑃(𝑁𝑁𝑖𝑖 , … 𝑁𝑁𝑛𝑛) = ∏ 𝑃𝑃(𝑁𝑁𝑖𝑖|𝑃𝑃𝑃𝑃(𝑁𝑁𝑖𝑖))𝑖𝑖 (2.31)

Figure 2.2 illustrated an example of a directed graph, which represents the following joint

probability function P(G, S, R)=P(G|S,R)P(S|R)P(R). Each node is connected by an arrowhead to

depict the dependence relations graphically.

Figure 2.2 – Example of a DAG. Note that the edges are directed.

If there are no a priori hypotheses or any mechanistic model is available, the correspondence

between the CI of the random variables and the representation of corresponding nodes on the

graph has to be modeled by specific structure learning algorithms which can be grouped in three

main categories[85]:

• Constraint based algorithms: they find the network that best explains dependencies and

independencies in the data. They use CI tests to detect the Markov blankets of the variables, i.e.

the area which includes the parents, the children and all the nodes that share a child with that

particular node. They are then used to build the network structure. These algorithms consist of

three steps[85],[86]:

1. Building of the skeleton of the network (i.e. the undirected graph underlying the network

structure). Since an exhaustive search is computationally unfeasible, all learning algorithms

use some kind of optimization such as restricting the search to the Markov blanket of each

2. Set all the direction of the edges considering each node triplets;

3. Set the directions of the other arcs as needed to satisfy the acyclicity constraint.

• Score-based algorithms: they find the highest-scoring network structure by assigning a

goodness-of-fit score to each candidate BN and trying to maximize it with heuristic search

algorithms (e.g. hill climbing, tabu search). During the exploration process, the scoring function

is applied in order to evaluate the fitness of each candidate structure to the data[85], [87].

Popular network scores include the log-likelihood score, the Akaike information criterion (AIC)

and the Bayesian Information Criterion (BIC) which are all based on the maximization of the

value of the likelihood function associated to each model;

• Hybrid approaches: they integrate constraint and score-based algorithms since they use

conditional independence tests and network scores at the same time[85],[86].

Structure learning algorithms are driven by distinct principles and metrics, so the resulting

models may be different. The algorithms based on independence tests perform a qualitative study

of the dependence and independence relationships between the variables in the domain, thus they

attempt to find a network that represents these relationships as far as possible.

Score-based algorithms instead attempt to find a graph that maximizes the selected score.

In other words, they represent a measure of closeness in approximating P(x) by a combination of

lower order distributions. Each algorithm is characterized by a specific scoring function and search

procedure used, thereby results may differ according to the methods applied[87]. For this reason,

it is recommended to use an algorithm of each category and to compare the results. Ideally, the

edges which appear in all models should represent the strongest dependencies.

Being forced to choose the direction of edges, BNs are not suitable for some domains (e.g.

spatial or relational data). In such cases, MN can be applied. Also for MN, the relationship among

variables (i.e. the nodes 𝑁𝑁𝑖𝑖) is based on CI but it is defined by the Markov property, expressed by:

𝑁𝑁1 ⊥ 𝑁𝑁2|𝑁𝑁𝑖𝑖 (2.32)

where ⊥ indicates independence in the joint distribution over the domain. Thus two nodes are

conditionally independent if there is no edge between them. Since directionality of the edges is not

considered, CI alone can be used to build a network[88]. In spite of this, to obtain more robust

networks, it is advisable to apply the same algorithms already described for BN structure learning.

One of the main difference between BNs and MNs is that MNs may be cyclic, therefore they

can represent cyclic dependencies which a BN cannot. Given their directionality, BNs can instead

represent induced dependencies and this enable to represent causality: an edge from A to B

indicates in fact that “variable A causes B” and this information can thus be used for inference[83].

Probabilistic graphical models are an attractive methodology in the omics field, since they

can model complex interactions between many variables of interest in an easily interpretable way.

Furthermore, they provide a robust analytic approach for identifying both predictors of between-

individual variation within a group of interest and other potentially interesting interactions between

physiological, pathological and environmental variables among different groups (e.g. survivors vs

non survivors). Probabilistic graphical models, and above all BNs, have been already applied in some

genomics and metabolomics study both to identify novel biomarkers, to reconstruct pathways of

interest and for prediction or classification purporses[66],[89],[90]. In this study, probabilistic

graphical models have been used for explorative analyses both on metabolomics data and on

integrated proteomics and metabolomics data sets. The models have been built using the R

packages gRapHD and bnlearn[85],[91].

2.7 FINAL CONSIDERATIONS

We are aware that the methods here presented do not cover all the possible strategies

available for omics data analysis and integration. In fact, as reported by Gromski et al.[81], several

methods, suitable for this purpose, can be found in the literature and are currently used to analyze

large and highly complex datasets. In this review, two other approaches are suggested: random

forest (RF) and support vector machines (SVM), which will be briefly described in this paragraph.

The random forests technique is an ensemble learning method that generates many

classification trees and aggregates them to compute a classification. In a decision tree, the instances

within each node are split into subgroups using, among all features available, the one that

maximizes a given criterion. The criterion, usually the Gini Impurity Index, evaluates the

homogeneity of new nodes locally. Random forests introduce variations in the samples and

instability over the single classification trees by drawing several bootstrap samples from the original

training data. Each bootstrap sample is used to fit a single classification tree, and, at each split in the

tree, the algorithm randomly restricts the set of predictor variables to select from. The ensemble

will then consist of a diverse set of trees. For prediction, an average over the predictions of the

single trees is used as it is proved to be more accurate than any of the single trees[92].

The main issue related to RF is that it is very difficult to interpret them in terms of underlying

mechanisms leading to the obtained classification. It is indeed critical to understand which variables

or interactions between variables are providing the predictive accuracy. The use of internal out-of-

bag (OOB) estimates was proposed in Breiman [92] to estimate the importance of each variable in

the model. When each tree is built, a number of samples randomly selected with replacement are

used to grow each tree. Because of the replacement, a subset of samples is not included in the

building process of each tree: it is the out-of-bag sample set of that tree. To assess the role of each

variable in the prediction performance of the forest, OOB of all the trees can be used. The procedure

proposed by Breiman uses samples not included in the building process of each tree of the forest to

test the tree performance and then a permutation of the same sample subset to test it again. The

comparison of the results permits to estimate an index, which is related to the importance of that

variable. If the performance of the forest does not change permuting that variable, it means that

the considered variable has low importance in forest classification performance.

Support Vector Machine (SVM) instead performs classification by constructing separating

planes which distinguish between objects of different class memberships in a multidimensional

space[93]. Given a labeled training data, the algorithm aims at finding the discriminating hyperplane

that maintains an optimal margin from the boundary of the training dataset. This hyperplane is

called support vectors, and it is used to categorize new instances. The optimal hyperplane, i.e. the

one that achieves the maximum geometric margin, is obtained by an iterative training algorithm by

finding the solution for a quadratic optimization problem.

Both RF and SVM do not produce a variable selection like the lasso (least absolute shrinkage

and selection operator) regression analysis or other shrinkage and regularization approaches. In this

case a wrapper strategy should be adopted. As already outlined, wrapper methods evaluate subsets

of variables which allows to detect the possible interactions between variables. The variables are

removed according to the variable ranking (variable importance in RF or by applying mRMR). The

selected subset is the one with the best performance and the most parsimonious (one-standard

error rule). The two main disadvantages of wrapper techniques are: the increasing overfitting risk

when the number of observations is insufficient and the significant computation time when the

number of variables is large.

To conclude, since metabolomics is a relatively young and complex field, to date there is no

universal choice of an analytical method which is superior in all cases. For this reason, only the

application of a variety of different approaches, suited to the data under analysis, may lead to robust

results.

3 MORTALITY PREDICTION FOR SEVERE SEPTIC SHOCK PATIENTS: A

TARGETED METABOLOMICS STUDY ON ALBIOS DATABASE

In this study we examined plasma metabolome and clinical features in a subset of 20 patients

with severe septic shock (SOFA score >8), enrolled in the multicenter Albumin Italian Outcome

Sepsis study (ALBIOS, NCT00707122). This work was partly presented at the XVI Congress of the

European Shock Society (ESS)1, and published as journal paper on Scientific Reports2. Our purpose

was to identify metabolites changes associated with 28-day mortality and to elucidate early

biomarkers signatures, which might help clinicians in prioritizing individual patient treatment during

shock. Blood samples were analyzed at the laboratory of Mass Spectrometry at IRCCS Mario Negri

Institute in Milan under the supervision of Dr. Roberta Pastorelli.

A mass spectrometry-based quantitative metabolomics approach was used to

simultaneously measure different metabolites classes, including acylcarnitines, amino acids,

biogenic amines, glycerophospholipids, sphingolipids, and sugars. The elastic net technique was

applied to find association with mortality on the basis of metabolite concentration levels and clinical

parameters. Our results showed that low unsaturated long-chain phosphatidylcholines species and

lysophosphatidylcholines species were associated with survival together with circulating

kynurenine. Moreover, these glycerophospholipids were negatively correlated to the event in

combination with clinical variables such as cardiovascular SOFA score. Overall, we observed that

early changes in plasma levels of both lipid species and kynurenine are associated with mortality

and this may have potential implications for early intervention and for the discovery new target

therapies.

In the follow we will present the rationale behind the study, the dataset used and the

methods applied. A discussion of the results will close the chapter.

1 A. CAMBIAGHI, L. Brunelli, Caironi P, et al., “SCK-3: Target metabolomics for improving early prediction of death in patients with septic shock”, XVI. Congress of the EUROPEAN SHOCK SOCIETY, Cologne, Germany, September 24-26 2015 (abstract published in Shock Journal, 2015, 44(2) - pp: 1-27) 2 M. Ferrario, A. CAMBIAGHI, L. Brunelli, S. Giordano, P. Caironi, L. Guatteri, F. Raimondi, L. Gattinoni, R. Latini, S. Masson, G. Ristagno, R. Pastorelli, “Mortality prediction in patients with severe septic shock: a pilot study using a target metabolomics approach”, Scientific Report, 6 (2016)

3. Mortality prediction for severe septic shock patients: a targeted metabolomics study on ALBIOS database

3.1 INTRODUCTION

ALBIOS (Albumin Italian Outcome Sepsis study, NCT00707122) is a recent, large, multicenter,

randomized clinical trial that enrolled 1818 patients with severe sepsis or septic shock. Given the

high rate mortality (46.7%)[63], similar to that observed in other comparable clinical studies[94],

[95], we believed that a multimarker strategy may be helpful to better understand the complex

pathogenesis of the disease and its evolution for early risk stratification and personalized therapies

implementation. Several recent studies have focused on investigating plasma metabolomics profiles

as predictive signatures of ICU mortality in adult patients[89], [96]–[98], thus the use of emerging

omics tools able of examining physiological responses at system level is particularly promising for

complex conditions such as septic shock.

In previous studies, different composite metabolite patterns have been identified with

nuclear magnetic resonance (NMR) or mass spectrometry (MS). Although these methods have

different intrinsic metabolomics coverage potential, they all clearly highlight the widespread

metabolic abnormalities in patients with septic shock, and the interplay of several different

biochemical pathways.

In the present study, we used a target mass spectrometry-based quantitative metabolomics

approach focusing our attention on several series of metabolites, some of which have already been

identified as part of key biochemical pathways in septic shock. More precisely, the metabolite

species mainly involved are the kynurenine and lysophosphatidylcholines, the alterations of which

has already been reported in septic shock patients[89],[94]-[98]. We applied such strategy on a

selected subset of patients with severe septic shock (SOFA>8 and lactate level >4 mmol/L) enrolled

in the ALBIOS study.

Our explorative study was designed to provide absolute quantitative information on changes

in plasma metabolite levels measured one day (initial acute phase) and one week after development

of severe septic shock, and to relate these changes with mortality. The two time points were chosen

to verify the hypothesis that the metabolic changes over the time period reflect not only initial

clinical characteristics, but also the progression of the disease and long-term survival. Association

between metabolic patterns and mortality was assessed with univariate and multivariate analyses

adjusted for clinical relevant variables.

The primary goal of this pilot investigation was to verify the feasibility of our metabolomics

approach that is intended to be used for ShockOmics clinical trial (NCT02141607), a study aimed at

elucidating early multilevel markers signatures which could reveal the metabolic pathways involved

in this syndrome as a necessary step for a target therapy.

3.2 MATERIAL AND METHODS

3.2.1 Study design, patients and clinical data

The multicenter ALBIOS clinical trial enrolled patients with severe sepsis or septic shock from

100 ICU in Italy (NCT00707122), as fully described in the original article[63]. We analyzed only a

subset of patients, selected according to the following inclusion criteria: the presence of septic shock

(i.e. presence of a proved or suspected infection in at least one site; two or more signs of systemic

inflammatory reaction syndrome; the presence of an acute sepsis-related cardiovascular

dysfunction or systolic blood pressure < 90 mmHg), total SOFA score > 8, serum lactate > 4 mmol/L,

and availability of plasma samples at day 1 (acute state, D1) and day 7 (steady state, D7) after

diagnosis of septic shock. Exclusion criteria included the presence of active hematological

malignancy or cancer, immunodepression, HIV, chronic renal failure, or cirrhosis. Only patients

discharged from ICU between 7-14 days from the occurrence of shock were considered. Such

inclusion and exclusion criteria were chosen in accordance with those of the multicenter clinical

study, ShockOmics (NCT02141607), as the current study represents a preliminary investigation.

Only 20 among the 1818 patients enrolled in ALBIOS trial and with plasma samples stored in

the biobank fulfilled the inclusion criteria. These patients were analyzed according to their survival

status 28 days after study enrollment and were thus classified into two groups: survivors (11

patients, S) and non-survivors (9 patients, NS). For each time point at which the blood samples were

collected (i.e. D1 and D7), we considered 24 clinical parameters and 137 metabolites concentrations

(µM).

3.2.2 Univariate analyses for metabolomics data

A targeted quantitative approach using a combined direct flow injection and liquid

chromatography (LC) tandem mass spectrometry (MS/MS) assay was applied for targeted

metabolomics analysis. Details on the protocol are illustrated in Appendix A. The changes in

metabolite concentrations from D1 to D7 within the same group were evaluated by means of the

paired Wilcoxon signed rank test. The comparisons between S and NS patients were performed by

unpaired Wilcoxon rank-sum test both at D1 and at D7. Finally, for each metabolite, the time-trend

variation in metabolites concentration (i.e. ∆=D7-D1) was compared between the two groups by

Wilcoxon rank-sum test. To overcome the problem of the large number of statistical comparisons,

the false discovery rate (FDR) was computed using a bootstrapping technique of oversampling with

replacement to obtain a sample size of 20 patients per group for a total of 40 observations. Results

were considered statistically significant when p-value <0.05 and FDR <0.15.

3.2.3 Multivariate analysis

The aim of the multivariate models was to predict NS patients. Four models were built

according to the different data set used: two models for metabolite concentrations at D1 and D7

respectively and two which combined metabolites and clinical parameters at D1 and D7

respectively. The technique used was the Elastic Net. Data were first normalized (Z-score

normalization) to have unitary variance and zero mean. Given the low number of subjects, also in

this case a bootstrapping with replacement was used to obtain a sample size of 20 patients per

group for a total of 40 observations. For every data set analyzed, different models were built with

2, 4, 5 and 10-fold cross validation (CV), and the model with the minimum Mean Squared Error (MSE)

was selected. The outcome (S = 0, NS = 1) was considered as output of the model. The best model

was selected among the different CV models based on one-standard error rule. The performance

was evaluated by means of the accuracy, i.e. the proportion of true classification (True Negative, TN

and True Positive, TP) among the total number of cases examined.

3.3 RESULTS

3.3.1 Clinical characteristics of the study population

Clinical characteristics, scores and comorbidities of the 20 patients at study enrolment (D1

are reported in Table 3.1. Patients were randomized to receive either 20% albumin and crystalloid

solutions (13 patients) or crystalloid solutions alone (7 patients) for volume replacement. In 11

patients (55%), source of infections was identified at site culture, including gram-negative (5

patients), gram-positive (2 patients) and both gram-negative and gram-positive bacterial infection

(3 patients), as well as fungal infection (1 patient). In 9 of these patients (82%), antibiotic therapy

empirically administered during the first 24 hours was appropriate. On day 28, mortality rate was

45% (9 patients died). No significant differences were found between the two groups (S and NS) at

enrolment. All the patients were treated according to the standard guidelines internationally

accepted for the treatment of patients with severe sepsis or septic shock.

ALL PATIENTS S NS # patients 20 11 (55%) 9 (45%)

Gender (Male) [# (%)] 13 (65%) 6 (55%) 7 (78%) Age (years) 70.5 (56.0, 77.0) 70.5 (56.0, 77.0) 72.0 (69.0, 76.5)

BMI - Body Mass Index 26.8 (25.4, 29.4) 26.8 (25.4, 29.4) 27.8 (23.8, 30.3) Heart Rate (bpm) 101.0 (96.0, 109.0) 101.0 (96.0, 109.0) 105.0(100.0, 110.3)

Mean Arterial Pressure (mmHg) 75.0 (65.0, 85.2) 75.0 (65.0, 85.2) 68.3 (65.3, 83.1) Venous Central Pressure (mmHg) 12.5 (7.5, 15.0) 12.5 (7.5, 15.0) 13.0 (9.5, 13.3)

Positive End Respiratory Pressure (cmH2O) 10.0 (6.5, 10.0) 10.0 (6.5, 10.0) 10.0 (7.5, 10.5) FiO2 57.5 (50.0, 60.0) 57.5 (50.0, 60.0) 60.0 (40.0, 61.3)

Central Venous O2 saturation (%) 78.5 (72.0, 81.5) 78.5 (72.0, 81.5) 80.0 (74.5, 83.0) PvCO2 (mmHg) 50.0 (42.5, 51.5) 50.0 (42.5, 51.5) 51.0 (43.3, 52.0)

PvO2 (mmHg) 45.5 (41.0, 49.0) 45.5 (41.0, 49.0) 47.0 (43.8, 51.5) PaCO2 (mmHg) 45.5 (37.5, 49.0) 45.5 (37.5, 49.0) 49.0 (37.5, 49.3)

PaO2 (mmHg) 104.0 (85.0, 136.0) 104.0 (85.0, 136.0) 92.0 (80.0, 116.5) Lactate (mmol/L) 3.4 (2.7, 5.4) 3.4 (2.7, 5.4) 4.7 (3.2, 6.3)

Platelets (103/mm3) 47.0 (27.0, 81.0) 47.0 (27.0, 81.0) 28.0 (20.5, 120.0) Creatinine (mg/dL) 2.5 (1.4, 3.2) 2.5 (1.4, 3.2) 1.9 (1.3, 3.4) Biliuribine (mg/dL) 1.9 (1.1, 3.0) 1.9 (1.1, 3.0) 2.0 (1.4, 6.5)

Arterial pH 7.4 (7.3, 7.4) 7.4 (7.3, 7.4) 7.4 (7.3, 7.4) Venous pH 7.4 (7.3, 7.4) 7.4 (7.3, 7.4) 7.4 (7.3, 7.4)

Urine Output (mL/day) 1625 (975, 3060) 1625 (975, 3060) 1600 (547, 2122) Use of renal replacement therapy [# (%)] 3 (15%) 1 (9%) 2 (22%)

Presence of ventilatory support [# (%)] 20 (100%) 11 (100%) 9 (100%)

CLINCAL SCORES SOFA 12.5 (9.5, 13.5) 12.5 (9.5, 13.5) 13.0 (10.0, 14.3)

Respiratory System 3.0 (2.0, 3.0) 3.0 (2.0, 3.0) 3.0 (1.8, 3.0) Coagulation 2.5 (2.0, 3.0) 2.5 (2.0, 3.0) 3.0 (1.0, 3.3)

Liver 1.5 (0.0, 2.0) 1.5 (0.0, 2.0) 2.0 (0.8,2.3) Cardiovascular System 4.0 (3.0, 4.0) 4.0 (3.0, 4.0) 4.0 (3.0, 4.0)

Renal System 2.0 (1.0, 3.0) 2.0 (1.0, 3.0) 2.0 ( 0.8, 3.3)

COMORBIDITIES Liver Disease [# (%)] 0 (0%) 0 (0%) 0 (0%)

Chronic obstructive pulmonary disease [# (%)] 2 (10%) 1 (9%) 1 (11%) Chronic Renal Failure [# (%)] 0 (0%) 0 (0%) 0 (0%)

Immunodeficiency [# (%)] 0 (0%) 0 (0%) 0 (0%) Congestive or ischemic heart disease [# (%)] 2 (10%) 1 (9%) 1 (11%)

Table 3.1 - Characteristics at study enrollment in the two groups of patients (S: survivors; NS: non survivors). Data are presented as median, 25th and 75th percentile or as frequency (%). The two groups did not significantly differ (p-value >0.05 Wilcoxon rank-sum for continuous variables test and p-value >0.05 Fisher exact test for categorical variables)

3.3.2 Time-course of plasma metabolites and association with mortality

We first assessed by univariate analysis whether metabolites levels significantly changed

from D1 to D7 within the same group (Wilcoxon signed rank test p<0.05, FDR<0.15). Figure 3.1 gives

a pictorial overview of the changes of metabolite concentrations (row, mean log2 μM) between D1

and D7 in survivors (left panel) and non-survivors (right panel). Five different species of

lysophosphatidylcholines (lysoPC), 19 of diacyl-phosphatidylcholines (PC aa), 26 of acyl-alkyl

phosphatidylcholines (PC ae), 2 of acylcarnitines (carnitine, C0; butyrylcarnitine, C4), 4 of long-chain

sphingomyelins (SM) increased from D1 to D7 in S patients, while kynurenine decreased. In NS

patients, we observed an overall increase from D1 to D7 of lysoPC, PC, and SM species; amino acids

doubled their plasma concentration together with putrescine.

Profiles of specific metabolites differed significantly between NS and S (Table 3.2). The

majority of PC and LYSOPC species showed lower values at D1 and D7 in NS when compared to S,

whereas higher concentrations of acetylcarnitine (e.g. C2) and of kynurenine were observed in NS

on D1 and D7, respectively. There were six lipid species comprising saturated long-chain lysoPC and

polyunsaturated very long-chain PC, whose levels decreased at D7 in NS.

METABOLITE S NS p-value FDR NS vs S

lysoPC a C16:1 0.657 (0.334, 0.970) 0.313 (0.291, 0.591) 0.040 0.003 ↓ PC aa C30:2 0.005 (0.005, 0.026) 0.099 (0.017, 0.147) 0.046 0.005 ↑ PC aa C38:1 0.757 (0.499, 0.936) 1.025 (0.886, 1.645) 0.028 <10-6 ↑ PC aa C38:6 164.022 (131.032,174.123) 92.206 (47.999,137.945) 0.033 <10-6 ↓ PC ae C38:0 2.633 (2.418, 3.202) 1.712 (1.124, 2.241) 0.015 <10-6 ↓ SM C20:2 0.095 (0.068, 0.121) 0.055 (0.043, 0.088) 0.048 0.001 ↓ C2 5.080 (3.369, 8.774) 11.066 (8.189, 21.852) 0.048 0.028 ↑

lysoPC a C16:0 47.046 (24.384, 58.821) 18.150 (14.455, 33.212) 0.048 0.001 ↓ lysoPC a C18:0 10.807 (6.188, 14.137) 5.684 (3.502, 7.111) 0.040 0.003 ↓ lysoPC a C24:0 0.096 (0.086, 0.108) 0.066 (0.062, 0.085) 0.010 <10-6 ↓ PC aa C32:3 3.486 (2.769, 4.240) 2.019 (1.807, 2.486) 0.028 0.001 ↓ PC aa C34:4 8.604 (6.879, 11.438) 4.150 (3.464, 5.720) 0.028 <10-6 ↓ PC aa C36:4 615.675(487.555,717.315) 369.822(340.428,463.466) 0.048 <10-6 ↓ PC ae C34:3 26.979 (20.127, 31.112) 18.013 (15.057, 21.835) 0.048 0.002 ↓ PC ae C40:1 1.633 (1.209, 1.706) 0.844 (0.770, 1.230) 0.010 0.012 ↓ PC ae C42:4 0.788 (0.679, 0.892) 0.608 (0.487, 0.663) 0.040 <10-6 ↓ Kynurenine 7.680 (4.965, 8.735) 12.000 (8.745, 23.800) 0.012 <10-6 ↑

Table 3.2 - Metabolite levels comparison between survivors (S) and non-survivors (NS) at day 1 and at day 7. Only significant results are reported (p < 0.05, FDR < 0.15). Plasma concentrations are expressed in μM and shown as median 25th and 75th percentile. The arrows indicate that the metabolite concentration in NS group is lower (↓) or higher (↑) with respect to S group.

Figure 3.1 - Heat maps of the metabolites (mean Log2 μM) whose concentrations changed significantly from D1 to D7 in S (right panel) and NS (left panel) (Wilcoxon signed rank test p<0.05, FDR<0.15)

The variation of metabolites from D1 to D7, expressed as ∆=D7-D1, were then compared

between S and NS. Significant differences in metabolite levels were found (Figure 3.2) (Wilcoxon

test p<0.05, FDR<0.15). As for S patients, a clear negative variation was observed for kynurenine,

whereas lysoPC and PC (mainly low saturated long-chain species) showed a positive variation.

Figure 3.2 - Comparison of the absolute differences in metabolite concentrations (μM) from day 1 to day 7 (Δ=D7–D1) in survivors (S) and non-survivors (NS), shown as box-plots. The outliers are defined as 1.5 times of interquartile range and highlighted by a cross. Each plot represents a different metabolite. (Wilcoxon test p < 0.05, FDR < 0.15).

3.3.3 Association between metabolic patterns and mortality

As combination of features can give more information than features considered individually,

we used prediction models with the aim of identifying a set of features that are mostly associated

to the target class, i.e. NS group. The models coefficients can be interpreted as follows: the higher

their absolute value, the higher their weight in the model; a positive coefficient denotes a positive

correlation with the event (i.e. 28-day mortality), a negative coefficient vice versa. On D1, PC aa

C38:1 and C4 (butyryl-acylcarnitine) had a strong positive correlation with 28-day mortality,

whereas PC aa C40:6 and PC ae C38:0 were inversely associated (Figure 3.3 panel A). On D7,

kynurenine and PC aa C42:4 were positively correlated to 28-day mortality, while PC aa C40:1 and

PC ae C40:1 are negatively correlated (Figure 3.3 panel B). Models accuracy was 0.84 ± 0.25 for the

model built on D1 data and 0.73 ± 0.35 for the one at D7.

Figure 3.3 - Elastic Net coefficients built on metabolite concentrations at D1 (A) and D7 (B). Models accuracy was 0.84 ± 0.25 and 0.73 ± 0.35 respectively.

3.3.4 Integrated clinical and metabolomics determinants of mortality

We next checked for a possible redundancy of prognostic information among circulating

metabolites and clinical variables. Figure 3.4 shows the best elastic net regression models that

considered both metabolites and clinical variables measured at D7, whereas no reliable predictive

models were obtained with metabolites and clinical parameters collected at D1. As shown in Figure

3, daily urinary output, plasma concentration of lysoPC a C24:0, and mean arterial pressure resulted

negatively associated with the outcome, while the risk of death increased with the cardiovascular

subcomponent of the SOFA score which represents the need for vasoactive drugs. The accuracy of

the model was 0.86±0.03.

Figure 3.4- Elastic Net coefficients built on metabolite concentrations and clinical parameters at D7. Model accuracy was 0.86 ± 0.03

3.4 DISCUSSION

This study is a preliminary investigation aimed to characterize the metabolomics profiles of

patients with severe septic shock, and to integrate them with the clinical manifestation

characterizing this syndrome. We identified several metabolomics alterations, previously reported

in patients with severe sepsis and septic shock[89], [96]–[102], supporting the feasibility and the

rationale underlying our pilot study design. Overall, profiles of specific metabolites measured on

day 1 and day 7 differed markedly between S and NS patients, and some metabolic features

appeared to be associated with mortality. Though we cannot discuss the significance of every single

metabolite, some general comments on the main metabolic pathways and their pathological

relevance to septic shock are warranted.

NS patients were characterized by a significant elevation of the polyamine pool (e.g.

spermidine), from D1 to D7. Since polyamines mediate the complex interplay between bacterial

infection and the host immune response[103],[104], this might suggest an altered regulation of

pathogen-host interactions in these patients. Moreover, non-survivors had increased plasma level

of glucogenic amino acids, in line with relative hepatic dysfunction occurring early in sepsis and

consequent derangement in the hepatic gluconeogenesis[103],[104]. A peculiarity of NS patients

was the significant increase from D1 to D7 of plasma kynurenine (Figure 3.2): kynurenine level at D7

was almost doubled in 28-day NS compared to S (Table 3.2). A clear relation has already been made

between accelerated tryptophan catabolism along the kynurenine pathway and inflammatory

reactions[107], [108]. Furthermore, it has recently been shown that kynurenine plasma level might

predict the development of sepsis in major trauma patients[109], and its modulation has already

been associated to 28-day mortality in critically ill patients[89]. Increased production of kynurenine

has been proposed to contribute to hypotension in sepsis[110] and it has been associated with

dysregulated immune response and impaired microvascular reactivity[111]. We can thus speculate

that the lower kynurenine concentration found in S patients may represent a favorable host

response trait. However, whether kynurenine metabolism is a pathogenic factor in sepsis or rather

an epiphenomenon needs further evaluation.

Decreased plasma level of PC and lysoPC species was a prominent component of the

metabolic phenotype in NS (Table 3.2 and Figure 3.2), in accordance with an overall lipidome

alterations observed in sepsis and critically ill patients[96],[101],[102]. Already 24 hours after

admission (i.e. D1), NS patients showed a marked decrease in PC species, containing long chain

polyunsaturated fatty acid (LCPUFAs), that persisted at D7 with further elongation/desaturation

products. Since LCPUFAs reduce T-cell activation and dampen inflammation[112], it might be

speculated that a decrease in PC containing LCPUFAs can hamper their protective effects, including

a concerted action of either withdrawing pro-inflammatory eicosanoids or incrementing anti-

inflammatory eicosanoids. We can reasonably exclude dietary-derived influence on LCPUFAs, since

the difference in their concentrations were present already at D1, and patients were all subjected

to dietary support according to standard guidelines on the treatment of patients with severe sepsis

or septic shock[59].

A general explanation for these findings is that the lowered circulating level of PCs found in

NS patients might be due to reduced or unbalanced availability of fatty acid substrates for their

biosynthesis, consistent with a deregulated mitochondrial and/or peroxisomal beta-oxidation

occurring early in sepsis, as already anticipated in Chapter 1. Regulation of fatty acid synthesis and

oxidation in the mitochondria is schematized in Figure 3.5. Briefly, malonyl-CoA, produced during

fatty acid synthesis, inhibits the uptake of fatty acylcarnitine (and thus fatty acid oxidation) by

mitochondria. When fatty acyl-CoA levels rise, fatty acid synthesis is inhibited and fatty acid

oxidation increases. Thus, the decreased plasma acetylcarnitine (C2, Table 3.2) observed in S

patients compared to NS ones would indicate a general more efficient use of substrates for energy

production and a probably reversible mitochondrial damage in survivors.

Figure 3.5 - Schema of the regulation of fatty acid synthesis and oxidation in the mitochondria.

A further bio-signature characterizing NS patients was their reduction over time in circulating

mono-saturated and saturated lysoPCs. Such changes in lysoPC concentration is in concordance with

Park et al.[100] who showed a similar downward trend of lysoPCs in 28-day non-survivors as

compared to survivors patients in sepsis. Decreased lysoPC levels have been also reported in septic

patients compared to healthy controls[99]. The well-known pro-inflammatory activities of

lysoPCs[111],[112] seem to contradict the poor outcome (and the lower lysoPC levels) observed in

sepsis. The reduction in circulating lysoPC may simply reflect their enhanced conversion to

lysophosphatidic acid, which is known to induce a multitude of cellular responses through its action

on immunological relevant cells[115]. Therefore, it is conceivable that lysoPC reduction may

promote an excessive immune response with detrimental effect in those patients who will not

survive[99],[100].

In the multivariate models, low unsaturated long-chain PC species were associated with

mortality together with circulating kynurenine (Figure 3.3). Moreover, lysoPC a C24:0 was negatively

correlated to the event in combination with clinical variables (Figure 3.4). The recurrent decrease in

lysoPC a C24:0 may denote an alteration in very long chain fatty acids, such as lignoceric acid as

preferred substrates for the peroxisomal beta-oxidation. Consequently, a down-regulation of the

peroxisomal lignoceryl-CoA ligase activity might be hypothesized[116].

Overall, our findings suggest a multifactorial origin for such abnormal phospholipids metabolism,

in which dysregulation of phospholipases, catabolism of lysoPC, peroxisomal dysfunction, imbalance

in the levels of saturated/unsaturated fatty acids could all be involved. However, the underlying

molecular mechanisms potentially regulating the circulating PC and lysoPC species in S patients as

compared to NS remain unclear and need further investigations.

We acknowledge that this study has several limitations. First, a targeted approach restricts,

by its nature, the panel of candidate markers and focuses only on few metabolic pathways. Second,

the sample size is limited, and confirmatory studies are necessary. This is mainly due to the fact that

many patients with such severity of septic shock do not reach day 7 of ICU staying, thus we end up

with a limited number of patients fulfilling our inclusion criteria. Third, we measured metabolites at

only two time points within one week from the diagnosis of septic shock; metabolites with temporal

changes out of this time window might thus provide a more precise insight for the clinical

progression of the disease. Nevertheless, we identified a combination of circulating metabolites

altered during the early course of severe septic shock and associated with mortality.

This preliminary investigation was therefore very informative in capturing possible evolution and

variations of metabolic signatures during a full blown, durable and well-established

pathophysiologic manifestation of severe septic shock. Focusing on a homogeneous group of

patients rather than on a larger number of scattered phenotypes allowed for a better control of

potentially confounding factors. Therefore, the metabolic changes observed in our samples pertains

more closely to the selected pathophysiological condition and it should be proved in a larger cohort

by including different phenotypes and not only the severe patients.

3.5 REMARKS

In conclusion, the data here presented confirm the feasibility of our approach in determining

changes circulating metabolites able to characterize the progress of septic shock condition. Our

results are in line with recent findings indicating that lipid homeostasis and tryptophan catabolism

might influence mortality in septic shock. The association of early changes in the plasma levels of

both lipid species and kynurenine with mortality, with possible implications for early intervention is

the most important result of our study. Although our analyses cannot determine causality, they

suggest that alterations in kynurenine and lipid species might represent not only risk factors for

patients with severe septic shock, but important pathophysiologic mechanisms deserving further

investigations.

4 INTEGRATION OF METABOLOMICS AND PROTEOMICS: AN

ANCILLARY STUDY ON ALBIOS DATABASE

Integration of metabolomics and proteomics information is a promising approach for

revealing molecular pathways as well as for identifying and quantifying differentially expressed

molecules, independently from multiple trigger factors leading to septic shock. To this purpose, we

examined plasma metabolome, proteome and clinical features in a subset of patients with severe

septic shock (SOFA score >8), enrolled in the multicenter ALBIOS study[63], already described in

Chapter 3. Proteomics analyses were performed at the Proteomics Platform, Parc Cientific de

Barcelona, Spain, under the supervision of Dr. Eliandre de Oliveira.

Overall, we aimed to integrate the results obtained by the metabolomic analyses previously

reported with the information coming from proteomics in order to have a wider picture of the

interactions occurring between metabolites and proteins and to gain deeper insights into septic

shock progression and individual patient’s response. We merged the results obtained by

spectrometry-based quantitative metabolomics with protein signals measured by iTRAQ. We then

applied the Elastic Net technique, LDA and PLS-DA to build integrated classification models used to

find features associated with mortality. Our results confirm that early changes in plasma levels of

lipid species are altered in non survivors. As for proteins, the most important differences between

the two groups are related to proteins which are part of the inflammatory response and of the

coagulation cascade, which are two of the most important pathways involved in septic shock

progression.

In the follow we will present the rationale behind our analyses, the study design and the

methods applied. Afterward, the obtained results will be compared and discussed.

4.1 INTRODUCTION

In the last decade, advances in high-throughput approaches have allowed the development

of proteomic and metabolomic studies for evaluating the association of genetic and phenotypic

variability with disease progression. These considerations are of fundamental importance in case of

complex multifactorial syndromes such as septic shock. In fact, response to treatment differs from

patient to patient and is extremely difficult to predict. Thus, integration of both proteomics and

4. Integration of metabolomics and proteomics: an ancillary study on ALBIOS database

metabolomics approaches may better describe the pathophysiological mechanisms involved and

allow a more complete characterization of this condition.

In the study previously outlined (see Chapter 3)[117], we found that profiles of specific

metabolites measured one day (D1) and one week (D7) after diagnosis of septic shock differed

markedly between survivors (S) and non-survivors (NS) patients. More precisely, we observed that

low unsaturated long-chain phosphatidylcholines (PCs) species and lysophosphatidylcholines

(lysoPCs) species were associated with survival together with circulating kynurenine. We thus

speculate that lipid homeostasis and tryptophan catabolism might influence mortality in septic

shock.

In light of these considerations, the objective of these analyses is to better characterize non

survivors patients according to the variations that occurred in metabolites concentration from D1

to D7, expressed as ratio D7/D1. This information will then be integrated with proteomics and

clinical data in order to acquire a more complete view of the pathways involved in this complex

syndrome.

This pilot retrospective investigation was an ancillary study of the multicenter ALBIOS clinical

trial (NCT00707122)[63]. Inclusion criteria for the present study are the same already adopted in

our previous analyses (see Chapter 3, paragraph 3.2.1)[117]. Three out of the 20 patients included

in our previous study have been excluded due to hemolysis in their blood samples and thus they

were not suitable for the proteomics analyses.

Patients were analyzed according to their survival at 28 days after study enrollment. For each

patient, plasma samples were available at day 1 (acute state, D1) and at day 7 (steady state; D7)

after diagnosis of septic shock. For each time point (D1 and D7), we considered 24 clinical

parameters (Table 1), 137 metabolites concentrations (µM) and 132 proteins values, expressed as

peak intensities, for a total of 293 features.

4.2.2 Proteomics data analyses

A multi-iTRAQ experiment was designed to compare the plasma protein pattern expression

between S and NS patients. Details on the protocol are illustrated in Appendix B. Briefly, sample

from 17 septic shock patients and from 5 healthy donors (M1 to M5) were arranged in six iTRAQ™

8plex experiment. The 5 healthy donors were used for LC-MS normalization purposes. For each of

the six iTRAQ run more than 200 proteins were quantified.

4.2.2.1 Criteria of proteins selection

The following procedure was followed for proteins selection: 1) identification of proteins,

which have been detected in all six runs; 2) removal of proteins which are contaminants, i.e. the

most abundant proteins that should already have been depleted before iTRAQ analyses (e.g. serum

albumin); 3) exclusion of proteins identified only by one peptide, even if unique for that protein.

After this selection, a total of 132 proteins were identified and considered for further analyses.

4.2.2.2 Quality control

To assess if measures from the six different runs are comparable, blood samples from five

healthy controls (M1 to M5) were included in the analyses and their replicates in different runs were

used to test for significant differences or bias. For each of the five control samples, the differences

in measured proteins abundance between the pairs of replicates were computed (e.g. for sample

labeled as M1 we computed Δ RUN 1-5 = M1 RUN1 – M1 RUN5).

For each of the series of differences, the Lillierfors test was performed against the null

hypothesis of Gaussian distribution and the Student test against the null hypothesis that the series

have mean value equal to zero. In this way, we statistically verify if the differences between runs

are randomly distributed around zero. For each of the series, both tests have a p-values < 0.05, thus

we cannot reject the hypothesis that differences are normally distributed around zero. This implies

that there are no biases.

4.2.3 Statistical analyses

For each of the 132 protein abundances, comparisons between S (9 patients) and NS (8

patients) at D1 and D7 were performed using a 2-way ANOVA. The observations were grouped

according to two factors: runs and outcome (S/NS). Two separate tests were performed, one for D1

and one D7. For each protein, a total of 3 p-values were calculated, defined as p-valueOUTCOME, p-

valueRUN, and p-valueOUTCOME*RUN. Only those proteins associated to p-valueOUTCOME<0.05 and p-

valueOUTCOME*RUN>0.05 were considered.

On these proteins, we compared the peak intensities measured at D1 and D7 of S and NS

groups by means of Wilcoxon rank-sum test. To overcome the problem of the large number of

statistical comparisons, we computed also the false discovery rate (FDR). The FDR was assessed after

the bootstrapping procedure: the sample size was increased from 9 to 20 subjects for the S group

and from 8 to 20 subjects for NS by means of a random sampling with replacement, for a total of 40

observations. Bootstrapping procedure was used only for the FDR assessment in order to increase

the samples number for the estimation of p-values distribution. Results were considered statistically

significant when p-value <0.05 (no bootstrapping) and FDR <0.15.

Comparisons between D1 and D7 within the same group were performed with a 2-way

ANOVA for repeated measures. The repeated measures model included outcome, run, and day (D1

or D7). A total of 4 p-values were computed: p-valueDAY, p-valueOUTCOME*DAY, p-valueRUN*DAY, p-

valueRUN*DAY*OUTCOME, where DAY represents the repeated factor. Those proteins (3 in total) which

were affected by the run (i.e. p-valueRUN*DAY*OUTCOME<0.05 and p-valueRUN*DAY<0.05) were excluded

from further analyses as we cannot exclude the run effect. Post hoc comparisons were then

performed on the remaining proteins and their trend from D1 to D7 was compared by means of the

paired Mann-Whitney t-test.

Finally, we compared the ratio D7/D1 for both metabolite concentrations and protein peak

intensities between S and NS by Wilcoxon rank-sum test. Also in this case the FDR was computed as

previously described and results were considered statistically significant when p-value <0.05 (no

bootstrapping) and FDR <0.15.

4.2.4.1 Data from targeted metabolomics analyses

Our aim was to characterize NS patients, in particular to find the species which are mostly

associated to the outcome.

We built the classification models on the ratio D7/D1 of metabolite concentrations. Because

of the small sample size (17 patients) and the large number of features (137 metabolites),

collinearity represents a crucial issue. The method used to reduce the features dimension is the

minimal-redundancy-maximal-relevance (mRMR), as previously described (see Chapter 2). We

discretized the features distribution according to the interquartile range before applying the mRMR

algorithm.

We considered the first 10, 20 and 30 ranked metabolites to build three different

classification models. Data were first normalized (Z score normalization) and the dataset was divided

into a training and test set as two third and one third of the observations, respectively.

We adopted two strategies to further select a smaller subset of features. We performed 50

times an elastic net logistic model using a logit function to fit the training set data. We considered a

binary classification (S = 0, NS = 1) and the output of the model is a value between 0 and 1, which

represents a sort of probability. We then selected the coefficients of the model with the minimal

deviance. We also applied another strategy, we used the shrinkage parameter λ, corresponding to

the model with the minimal deviance, to fit another elastic net model and to obtain the coefficients

of the logistic regression. In both cases, the models were then evaluated on the testing set and the

performance were assessed by the number of correct imputations.

LDA and PLS-DA were also implemented. More precisely, LDA was performed on the first 10

ranked metabolites and the coefficients for the linear boundary between the first and second

classes were retrieved. PLS-DA was performed both on the first 10 and 20 ranked metabolites,

considering 3 PLS components. Since the groups are unbalanced, the data matrix was weighted

centered in order to avoid having a decision boundary shifted towards the most numerous group.

The performance of the classification models was evaluated by considering the number of correct

imputations.

4.2.4.2 Integration of targeted metabolomics and proteomics data

We built an integrated model by merging targeted metabolomics and proteomics data. Also

for proteomics data we computed the ratio D7/D1 for each of the 132 protein peak intensities. To

avoid multicollinearity, the mRMR algorithm was applied and the first 50 ranked proteins were

selected. These proteins were then combined with the first 50 ranked metabolites and the mRMR

was performed again on this new features subset composed of 50 metabolites and 50 proteins.

After Z score normalization, we considered the first 10, 20 and 30 ranked features to build the

classification models using the two strategies described in the previous paragraph. LDA and PLS-DA

were also performed as stated above.

4.2.4.3 Integration of metabolomics, proteomics and clinical data

Finally, we built a comprehensive model which combines targeted metabolomics,

proteomics and clinical data. Only continuous clinical variables were considered for which the ratio

D7/D1 was computed. Total SOFA score and partial SOFA scores were not included to avoid any

redundancy. In fact, they are calculated from clinical parameters which are included in the model

and they are thus likely to be correlated. Finally, a total of 17 clinical variables were included.

The 17 clinical variables were added to the first 20 ranked features from the set of

metabolites and proteins, obtained as previously described. The mRMR was then performed on this

subset of features to further reduce the number of features. After Z score normalization, the first

10, 20 and 30 ranked features were selected to build the classification models. LDA and PLS-DA were

also performed on these three sets of features.

4.3 RESULTS

Patients with severe septic shock enrolled in the multicenter ALBIOS clinical trial[63], and

fulfilling the inclusion/exclusion criteria as previously reported, were analyzed. The baseline

characteristics of these 17 patients, source and kind of infection are reported in Table 4.1. To note

that they are quite similar to the previous one (see Table 3.1). In 9 patients, source of infection was

identified at site culture, including gram-negative (4 patients), gram-positive (2 patients) and both

gram-negative and gram-positive bacterial infection (gram mix, 2 patients), as well as other kinds of

microorganisms (mixed, 1 patient). Two patients (one S and one NS), had multiple infections (S

abdomen and other, NS lungs and other). 9 out of 17 patients (82%) received antibiotic therapy

empirically decided during the first 24 hours. Patients were randomized to receive either 20%

albumin and crystalloid solutions (10 patients) or crystalloid solutions alone (7 patients) for volume

replacement. On day 28, mortality rate was 47% (8 patients died).

ALL PATIENTS S NS Age (years) 66.1 ±13.9 63. 8 ± 16.6 67.9 ± 12.5

BMI (kg/m2) 27 ± 3.9 27.5 ± 3.9 27.9 ± 3.2

SOURCE OF INFECTION Lungs [# (%)] 6 (35%) 1 (11%) 5 (63%)

Abdomen [# (%)] 8 (47%) 4 (44%) 2 (25%) Genitourinary [# (%)] 5 (29%) 5 (56%) 0 (0%)

Other [# (%)] 3 (18%) 1 (11%) 2 (25%)

KIND OF INFECTION Negative [# (%)] 8 (47%) 3 (33%) 5 (63%)

Mixed [# (%)] 1 (6%) 0 (0%) 1 (13%) Gram positive [# (%)] 2 (12%) 1 (11%) 1 (13%)

Gram negative [# (%)] 4 (24%) 3 (33%) 1 (13%) Gram mix [# (%)] 2 (12%) 2 (22%) 0 (0%)

Table 4.1– Clinical characteristics in survivors (S) and non survivors (NS) at enrollment. For 8 patients (3 S and 5 NS) the bacterial culture had a negative results (negative). No statistically significant differences were found between the two groups.

Clinical and laboratory variables on day 1 (D1) and day 7 (D7) are reported in Table 4.2. All the

patients were treated according to the standard guidelines internationally accepted for the

treatment of patients with severe sepsis or septic shock. No significant differences were found

between the two groups.

D1 D7 S NS S NS

Heart Rate (bpm) 103.5 ± 28.4 106.1 ± 12.8 80.4 ± 11.0 91.1 ± 8.7 Mean Arterial Pressure (mmHg) 76. 6 ± 18.3 72.0 ± 11.0 96.3 ± 13.8 * 78.4 ± 12.0 *

Central Venous Pressure (mmHg) 11.4 ± 5.8 11.5 ± 4.4 7.9 ± 5.1 8.8 ± 2.1 Urine output (mL) 2556.1 ± 918.4 1840.0 ± 1652.8 3705.6 ± 1580.2 * 1737.5 ± 1478.9 *

FiO2 (%) 59.7 ± 12.4 56.3 ± 20.8 40.6 ± 8.5 46.3 ± 23.4 ScvO2 (%) 73.3 ± 11.6 78.5 ± 7.5 77.1 ± 7.3 77.9 ± 4.5

PvCO2 (mmHg) 46.8 ± 5.7 47.8 ± 4.9 50.6 ± 5.2 46.5 ± 7.8 PaCO2 (mmHg) 42.3 ± 6.3 44.3 ± 6.4 45.3 ± 3.9 41.6 ± 8.3 PvO2 (mmHg) 43.3 ± 4.6 46.5 ± 7.4 44.1 ± 5.9 44.6 ± 5.9 PaO2 (mmHg) 122.2 ± 61.0 98.5 ± 32.0 126.9 ± 29.8 115.1 ± 61.0 LAT (mmol/L) 3.0 ± 1.6 5.0 ± 2.3 1.4 ± 0.5 2.4 ± 2.2

Platelets (x103/mm3) 63.9 ± 35.4 61.4 ± 68.1 112.0 ± 67.2 80.3 ± 50.9 Serum creatinine (mg/dL) 2.7 ± 0.9 2.1 ± 1.3 1.8 ± 1.7 1.8 ± 1.5

Serum biluribin (mg/dL) 1.7 ± 0.9 5.0 ± 4.8 1.9 ± 1.3 9.1 ± 10.8 Presepsin (µg/L) 1486 ± 1256 2673 ± 2351 830 ± 458 4969 ± 5826

Arterial pH 7.4 ± 0.1 7.4 ± 0.1 7.5 ± 0.0 7.4 ± 0.1 Venous pH 7.4 ± 0.0 7.4 ± 0.1 7.4 ± 0.0 7.4 ± 0.1 CVV [# (%)] 0 (0%) 2 (25%) 1 (11%) 3 (38%)

Ventilatory Support [# (%)] 9 (100%) 8 (100%) 4 (44%) 7 (88%)

CLINICAL SCORES SOFA 11.3 ± 2.4 12.4 ± 3.2 5.0 ± 2.1 9.3 ± 5.1

Respiratory System 2.4 ± 1.0 2.4 ± 1.3 1.2 ± 0.8 1.9 ± 1.0 Coagulation 2.3 ± 0.9 2.5 ± 1.4 1.6 ± 1.1 1.9 ± 1.2

Liver 1.1 ± 0.9 1.9 ± 1.2 1.0 ± 1.0 2 ± 1.8 Cardiovascular System 3.6 ± 0.5 3.6 ± 0.5 0.0 ± 0.0 1.1 ± 1.5

Renal System 1.9 ± 0.6 2.0 ± 1.7 1.2 ± 1.5 2.4 ± 1.8

Table 4.2 – Clinical and laboratory variables at D1 and D7 for the 17 patients, divided in survivors (S, 9 pts) and non-survivors (NS, 8 pts). Data are presented as mean ± SD or as frequency. Mean Arterial Pressure and Urine output (marked with *) at D7 were significantly different between the two groups (p-value <0.05 Wilcoxon rank-sum test).

4.3.2 Changes in protein expressions between groups

between S and NS patients. Criteria of proteins selection and quality control are described in details

in Appendix B. In total, 132 proteins were selected after quality control. For the significant proteins,

extended name and main functions are reported in Table B.1 (Appendix B).

We first assessed by univariate analysis if protein levels are significantly different between

the S and NS separately at the two time points (Wilcoxon rank-sum test p-value < 0.05, FDR < 0.15).

Proteins P02745, Q86VB7, Q96PD5 and Q9Y5Y7 were significantly different between S and NS at D1

(Figure 4.1) and proteins P05543, P13796 and P36222 at D7 (Figure 4.2).

Figure 4.1 - Boxplots of protein peak intensities significantly different between S (blue) and NS (orange) at D1 (Wilcoxon rank-sum test p < 0.05, FDR < 0.15). Distribution of differences is shown as box-plot, each plot represents a different protein: P02745, Complement C1q subcomponent subunit A; Q86VB7, Scavenger receptor cysteine-rich type 1 protein M130; Q96PD5, N-acetylmuramoyl-L-alanine amidase; Q9Y5Y7, Lymphatic vessel endothelial hyaluronic acid receptor 1.

Figure 4.2 - Boxplots of protein peak intensities significantly different between S (blue) and NS (orange) at D7 (Wilcoxon rank-sum test p < 0.05, FDR < 0.15). Distribution of differences is shown as box-plot; each plot represents a different protein: P05543, Thyroxine-binding globulin; P13796, Recombinase Flp protein; P36222, Chitinase-3-like protein 1.

4.3.3 Time trend variation of proteins and metabolites

Changes in proteins levels from D1 to D7 within the same group was assed as well: 14

proteins significantly change from D1 to D7 in the NS group and 10 in the S group (Wilcoxon rank-

sum test p-value < 0.05). Of these proteins, 9 are significantly different from D1 to D7 in both groups.

The temporal trends in the two groups and reported in Table 4.3.

Differences in the ratio D7/D1 between S and NS patients for proteins and metabolites are

shown in Figure 4.3 and 4.4: 9 proteins and 5 metabolites are significantly different between the

two groups.

Figure 4.3 - Boxplot of the ratio D7/D1 of protein peak intensities significantly different between S (blue) and NS (orange) (Wilcoxon rank-sum test p < 0.05, FDR < 0.15). Distribution of differences is shown as box-plots. Each plot represents a different protein: P00746, Complement factor D; P00915, Carbonic anhydrase 1; P02649, Apolipoprotein E; P02745, Complement C1q subcomponent subunit A; P02746, Complement C1q subcomponent subunit B; P02765, Alpha-2-HS-glycoprotein; P05155, Plasma protease C1 inhibitor; P18065, Insulin-like growth factor-binding protein 2; Q9Y5Y7, Lymphatic vessel endothelial hyaluronic acid receptor 1.

D1 D7 TREND D1 D7 TREND P00751 16.704(16.203,17.223) 16.308(16.183,16.619) ↓ 17.004(16.665,17.101) 16.630(16.122,16.759) * ↓ P01011 17.046(16.880,17.197) 16.799(16.493,17.059) ↓ 17.292(16.897,17.452) 16.509(16.228,16.994) * ↓ P02649 15.413(14.987,15.639) 15.691(15.234,16.187) ↑ 15.222(15.077,15.757) 16.347(16.001,16.892) * ↑ P02741 17.228(16.847,18.673) 15.921(15.482,16.165) * ↓ 18.037(17.845,18.162) 16.385(15.815,16.683) * ↓ P02750 17.099(16.557,17.530) 16.510(15.985,16.653) * ↓ 17.445(16.468,17.906) 16.611(15.951,17.189) * ↓ P06681 15.394(15.059,15.611) 14.951(14.627,15.195) * ↓ 15.579(15.479,15.909) 15.446(15.129,15.582) * ↓ P07358 15.294(14.675,15.771) 14.908(14.361,15.434) ↓ 14.573(14.521,15.862) 14.434(14.298,15.415) * ↓ P07360 15.731(15.366,16.235) 15.434(15.343,15.690) * ↓ 15.889(15.530,16.313) 15.490(15.107,15.969) * ↓ P15169 14.701(14.506,15.917) 14.482(14.119,15.433) ↓ 14.922(14.082,15.805) 14.641(13.806,15.523) * ↓ P18428 15.553(14.594,15.776) 14.077(13.318,14.515) * ↓ 15.485(14.620,15.728) 14.378(13.827,14.895) * ↓ P22792 14.339(14.097,14.900) 14.112(13.789,14.686) * ↓ 14.521(14.228,14.798) 14.190(13.916,14.551) * ↓ P25311 16.106(15.665,16.971) 17.427(16.729,17.668) * ↑ 16.466(15.381,17.172) 17.398(16.155,17.793) * ↑ P36222 13.330(12.748,14.194) 11.798(11.524,11.972) * ↓ 14.467(14.065,14.723) 12.388(12.076,13.035) * ↓ P49908 13.549(13.239,14.047) 14.240(14.188,14.828) * ↑ 13.163(12.914,13.674) 13.938(13.582,14.338) * ↑ Q15582 14.452(13.352,14.611) 13.777(12.819,14.227) * ↓ 14.157 (13,277, 14,384) 14.079 (13.147, 14.143) ↓

Table 4.3 - Comparison of protein level changes in survivors (S) and non-survivors (NS) at D1 and at D7. Significant differences between D1 and D7 are marked with * (Wilcoxon sign-rank test pval<0.05). Plasma concentrations are expressed as peak intensities and shown as median (25, 75 percentiles).

Figure 4.4- Boxplot of the ratio D7/D1 of metabolites concentrations significantly different between S (blue) and NS (orange) (Wilcoxon rank-sum test p < 0.05, FDR < 0.15). Distribution of differences are shown as box-plots.

4.3.4.1 Regression analysis for targeted metabolomics data

We used classification models with the aim of identifying the set of features which are mostly

associated to the target class, i.e. the non survivors (NS). The coefficients of the models obtained

from metabolomics concentrations only are reported in Table 4.4. The interpretation of the

coefficients in a logistic regression is not trivial. If we express the relationship as:

𝑝𝑝1−𝑝𝑝

= exp (𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥2+. . ) (4.1)

we can say that if the coefficient βi is positive then an increase of feature xi will be associated with

an increase of the odd ratio, i.e. the probability to belong to class 1 is higher than to class 0, all other

variables xj being equal. On the contrary, if the coefficient βi is negative, then an increase of the

feature xi will be associated with a decrease of the odd ratio, i.e. the probability to belong to class

1 is lower. Three metabolites were selected in all models: PC aa C42:6, PC aa C36:6 and tyrosine.

Figure 4.5.A shows the coefficient values of the model built according to the criterion of minimal

deviance on the first 30 ranked features. All the obtained models correctly classify the observations

in the testing set.

4.3.4.2 Regression analysis for targeted metabolomics and proteomics data

We build the classification models combining metabolomics and proteomics data, as

described in the methods section. The coefficients of the models are reported in Table 4.5; Figure

4.5.B shows the coefficient values of the model built according to the criterion of minimal deviance

on the first 30 ranked features. From Table 4.5 we can notice that lysoPC a C24:0 and the protein

P02745 are selected by all models. Moreover, PC aa C36:3 and PC aa C42:6 were again selected in

these models, and their coefficients maintain the same signs as in previous ones. All the obtained

models correctly classify the observations in the testing set.

4.3.4.3 Regression analysis for targeted metabolomics, proteomics and clinical data

Finally, we build a model combining metabolomics, proteomics and clinical data as described

in the methods section. The coefficients of the models are reported in Table 4.6. Figure 4.5.C shows

the coefficient values of the model built according to the criterion of minimal deviance on the first

30 ranked features. We can notice that also in these models the protein P02745 appears among the

most important predictors. Another protein, i.e. P02790, and PC aa C34:3 were also selected by all

models. All the obtained models correctly classify the observations in the testing set.

Figure 4.5- Coefficient values of the logistic regression models built according to the criterion of minimal deviance on the first 30 ranked features for targeted metabolomics (panel A), integration of metabolomics and proteomics (panel B) and for the integration of omics data with clinical parameters (panel C).VCP: Central Venous Pressure; PEE: Positive End-expiratory pressure; PAC: PaCO2; MAP: Mean Arterial Pressure.

10 features 20 features 30 features METABOLITES min Dev fixed λ min Dev fixed λ min Dev fixed λ

PC aa C42:6 -0.763 -0.213 -0.672 -2.083 -0.557 -0.466 PC aa C40:6 - - - - -0.380 -0.005 PC ae C42:1 - - - -1.009 - - lysoPC a C24:0 -0.498 -0.223 - -0.622 -0.233 -0.241 lysoPC a C20:4 - - - - -0.188 -0.025 SM OH C16:1 - - - -1.137 - - SM C24:1 - - - 0.263 -0.182 - SM C22:3 - - - - - -0.167 SM C24:0 - - - - - -0.030 PC ae C42:5 - - - - -0.160 - PC aa C42:2 -1.103 - -0.136 - -0.149 - PC aa C34:4 -1.333 - - -0.191 - - Met - - - - -0.105 -0.082 PC ae C30:2 - - - -0.565 -0.073 -0.173 PC aa C36:6 - - - - 0.013 - PC aa C42:5 - - - - 0.063 0.271 PC aa C36:3 2.280 0.338 0.442 1.931 0.178 0.479 Pro - - 0.652 1.946 0.198 - PC aa C34:3 - - 0.262 - 0.582 - PC aa C42:1 - - 1.135 2.824 0.653 0.716 Tyr 3.151 0.061 0.021 1.075 0.751 0.126 PC ae C30:1 - - 0.305 1.151 0.820 0.300 Creatinine - - 0.377 1.489 1.623 0.253 Performance Dev=4.02 Dev=23.77 Dev=8.69 Dev=24.98 Dev=9.15 Dev=25.62

Table 4.4 - Coefficient values of the logistic regression models for the first 10, 20 and 30 metabolites, computed according the two strategies (minimal deviance and estimated λ). The coefficients of the metabolites which are common to all models are in bold. The bottom row reports values of deviance of the obtained models.

10 features 20 features 30 features FEATURES min Dev fixed λ min Dev fixed λ min Dev fixed λ

P02790 - - -0.416 -0.227 -1.630 -0.354 lysoPC a C24:0 -1.175 -0.641 -0.993 -0.372 -1.251 -0.628 PC aa C42:6 - - -0.801 -0.395 -0.186 -0.579 P02745 -1.087 -0.829 -0.187 -0.289 -0.774 -0.497 P20851 -0.485 - - - - - lysoPC a C17:0 - - - - -0.306 -0.105 SM OH C16:1 - -0.096 -0.335 -0.050 - -0.487 P02746 - - -0.235 -0.091 -0.249 -0.347 PC aa C42:2 -0.064 - - - -0.044 - PC aa C34:3 - - - 0.115 0.012 - PC ae C30:1 - - - - 0.017 0.346 O75882 0.389 0.239 - 0.086 0.212 0.472 Pro - - 0.054 - 0.238 0.169 P06276 - - - - 0.240 - P06727 - - - - - 0.245 P19823 - - 0.701 0.357 0.256 0.913 P05543 0.093 0.301 0.011 0.108 - 0.327 PC ae C42:1 - - -0.286 - 0.283 - Tyr - - - - - 0.309 P01034 - - - - 0.909 0.895 PC aa C36:3 - - 0.628 0.343 1.184 0.888 Performance Dev=10.6 Dev=33.89 Dev=8.66 Dev=21.52 Dev=3.96 Dev=16.21

Table 4.5 - Coefficient values of the logistic regression models for integration of metabolomics and proteomics built on the first 10, 20 and 30 features and computed according the two strategies (minimal deviance and estimated λ). The coefficients of the features which are in common to all models are in bold. The bottom row reports values of deviance of the obtained models.

10 features 20 features 30 features FEATURES min Dev fixed λ min Dev fixed λ min Dev fixed λ

P02745 -1.140 -0.664 -0.410 -0.805 -0.374 -0.450 PC ae C34 3 - - - - -0.235 - Mean Arterial Pressure -0.580 -0.148 -0.255 -0.221 - PC aa C34 3 -0.315 -0.325 -0.279 -0.271 -0.187 -0.027 PaCO2 - - -0.192 -0.539 -0.138 -0.118 P02790 -0.273 -0.306 -0.383 -0.229 -0.117 -0.135 ScvO2 - - -0.082 -0.067 - - Serum bilirubine - 0.029 - - FiO2 - - 0.137 0.291 - 0.023 O75882 - - - - 0.150 P20851 - - - - 0.169 PC ae C44 4 - - - - 0.211 0.149 PEEP 0.738 0.299 - 0.439 0.267 0.054 Central Venous Pressure - - - - 0.372 - Heart Rate 0.124 - - - - - P06276 - - 0.260 - - - PC ae C44 4 - - 0.379 0.572 - - Urine Output 0.414 - 0.138 0.242 - - Serum creatinine - - 0.432 0.423 - 0.033 Performance Dev=10.22 Dev=27.08 Dev=11.65 Dev=32.83 Dev=11.75 Dev=30.04

Table 4.6 - Coefficient values of the logistic regression models for integration of omics data with clinical parameters built on the first 10, 20 and 30 features and computed according the two strategies (minimal deviance and estimated λ). The coefficients of the features which are in common to all models are in bold. The bottom row reports values of deviance of the obtained models.

4.3.4.4 Discriminant analysis

Table 4.7, 4.8 and 4.9 report the coefficient values of the LDA models and the VIP scores of

the PLS-DA models built on the first 10 and 20 ranked features according to mRMR for targeted

metabolomics, for metabolomics and proteomics data, and for omics and clinical data respectively.

We cannot use the entire subset of 30 features due to the lower number of observations (i.e. 17

patients only). In fact, as explained in §2.4, the computation of the boundary region requires the

covariance matrix to be invertible and this is not the case.

In the metabolites model, it is worth to notice that PC aa C36:3, which already played an

important role in the regression models, occupies the second and first position in the VIP ranking,

when considering 20 and 10 features respectively. As for the integrated model, P02745 is in the first

position and lysoPC a C24:0 the third (20 feature model) and second (10 feature model), thus

confirming the importance of these features already emerged from the regression analysis. In the

classification models for omics and clinical data and for metabolomics and proteomics data, we can

notice that P02745 occupies the first position followed by another protein, i.e. P02790, in

agreement to what arose from the regression analysis. Three-dimensional PLS-DA score plots on 20

features for the three models are shown in Figure 4.6. In all cases, the groups separate perfectly.

METABOLITES VIP PLS-DA 20 VIP PLS-DA 10 LDA PC aa C42:1 1.633 - - PC aa C36:3 1.400 1.516 10.290 lysoPC a C24:0 1.372 1.411 0.200 lysoPC a C17:0 1.233 - - PC aa C42:6 1.159 1.120 1.785 PC ae C30:1 1.137 - - PC aa C34:3 1.118 - - Tyr 1.096 1.042 7.865 Pro 1.058 - - Creatinine 0.965 - - PC ae C42:1 0.915 0.827 -3.442 PC ae C30:2 0.858 0.836 6.010 PC aa C42:2 0.815 0.886 -6.153 SM C24:1 0.756 - - PC aa C42:5 0.753 - - SM OH C16:1 0.639 0.747 -13.680 PC ae C34:3 0.600 0.417 - PC aa C36:6 0.585 - -3.395 PC aa C34:4 0.521 0.686 1.921 PC ae C44:4 0.266 - -

Table 4.7 – VIP scores of PLS-DA and coefficients of LDA for the metabolites models.

FEATURES VIP PLS-DA 20 VIP PLS-DA 10 LDA P02745 1.438 1.576 -4.822 PC aa C36:3 1.367 - - lysoPC a C24:0 1.324 1.433 -2.041 P19823 1.282 - - PC aa C42:6 1.267 - - P02746 1.249 - - P02790 1.223 - - P05543 1.053 1.092 1.627 PC aa C34:3 1.035 - - PC aa C42:2 0.936 1.006 0.729 PC ae C42:1 0.909 - - SM OH C16:1 0.876 0.782 -0.979 O75882 0.866 0.920 0.961 P22792 0.801 0.826 -2.583 P16070 0.745 0.653 1.940 P20851 0.721 0.438 -0.326 P06276 0.629 0.706 0.692 Q14520 0.533 - - PC ae C34:3 0.397 - - PC ae C44:4 0.232 - -

Table 4.8– VIP scores of PLS-DA and coefficients of LDA for the integrated metabolomics and proteomics data models.

FEATURES VIP PLS-DA 20 VIP PLS-DA 10 LDA P02745 1.681 1.620 -15.017 P02790 1.455 1.412 1.349 PC ae C44 4 1.334 - - PvCo2 1.308 - - PEEP 1.235 1.053 5.906 PaCO2 1.229 - - PC aa C34 3 1.124 1.030 0.172 FiO2 1.061 - - Serum creatinine 0.986 - - Urine Output 0.984 0.794 8.590 P05543 0.896 - - Mean arterial Pressure 0.847 0.825 -1.140 Serum bilirubine 0.814 0.601 6.668 P06276 0.659 - - Serum lactate 0.601 0.268 -1.285 Central Venous Pressure 0.559 - - Heart Rate 0.552 0.789 6.036 ScvO2 0.522 - - P16070 0.515 0.919 6.023 pHa 0.266 - -

Table 4.9 - VIP scores of PLS-DA and coefficients of LDA for the integrated omics data and clinical parameters models.

Figure 4.6 - Three-dimensional PLS-DA score plots on 20 features for the metabolites model (panel A), integration of targeted metabolomics and proteomics (panel B) and for the integration of omics data with clinical parameters (panel C).The two groups are perfectly separated.

4.4 EXPLORATIVE ANALYSES

Explorative analysis by probabilistic graphical models have been performed with the aim to

highlight dependences among features, and to verify whether there are differences in this

dependences between S and NS patients. To this purpose, we considered the dataset on which we

built the integration model for metabolites and proteins (i.e. first 50 ranked metabolites and

proteins) to build a Markov Network (MN) for S and NS patients respectively. We adopted a two-

step approach. Firstly, the maximum likelihood network was found by applying the algorithm of

Chow and Liu[118]. Afterwards, forward search was performed on each triangulated graph: the

algorithm repeatedly adds the edge that optimizes a selected measure until no more add-eligible

edges are found. For both steps, the minimized measure used is the Bayesian Information Criterion,

as described in Chapter 2.6. The two networks obtained for S and NS patients are shown in Figure

Since it may be difficult to capture meaningful dependencies among so many features, we

isolated the so called “hubs”, i.e. nodes with several direct neighbors (colored in the figure). In fact,

due to their high connectivity, we speculate that these nodes could have an important role in the

network. It is also possible to see that there is a high number of “leafs", i.e. vertices having only one

edge. By comparing the two networks, we can notice that they have a different structures and

different hubs. More precisely, the S group network has one hub which is connected with six direct

neighbors, whereas the NS group network has one hub connected with seven direct neighbors. The

number of leafs was 17 in the R model (34%) and 14 in the other (28%).

To better understand the network structure, we concentrated on the neighborhood of the

different hubs up to the second node. The hub is constituted by P19823 in S patients, and by

methionine (Met) in NS patients, as shown in Figure 4.8.

Figure 4.7- MN model of metabolites concentration in S (left panel) and NS (right panel) patients, highlighting the hubs (1 red for S and 1 blue for NS).

Figure 4.8- MN model of metabolites concentration in S (left panel) and NS (right panel) patients, highlighting the hubs (1 red for S and 1 blue for NS).

4.5 DISCUSSION

We performed a feature reduction to select the variables to enter the classification models,

which were again built using different techniques (regularized logistic regression and discriminant

analysis).

Our results are in line to the previous results (Chapter 3) and confirm the involvement of

lysoPCs and PCs in septic shock progression, which indicates an overall lipidome alteration in NS

patients. The novelty of the present study is the integration with the proteomics analysis. In

particular, the proteins significantly different between the two groups are involved in the pathways

of coagulation, innate immunity and inflammatory response (see Table B.1 in Appendix B for a

complete list). We focused in particular on one protein, P02745, i.e. complement C1q

subcomponent subunit A, whose peak intensities were significant both in the univariate and

multivariate analysis (see Figure 4.1, 4.3 and 4.5). Complement C1q subcomponent subunit A

associates with the proenzymes C1r and C1s to yield C1, the first component of the serum

complement system, as shown in Figure 4.9.

Figure 4.9 – Scheme of the complement protein C1 and of its subcomponents. Efficient activation of C1 takes place on interaction of the globular heads of C1q with the tail of IgG or IgM antibody present in immune complexes.

The complement system is a part of the immune system, which enhances the ability of

antibodies and phagocytic cells to clear microbes and damaged cells and promotes inflammation. It

consists of a several small proteins circulating in the blood as inactive precursors. After stimulation,

specific proteases cleave proteins to release cytokines and initiate an amplifying cascade of further

cleavages. The end result of this complement activation cascade is stimulation of phagocytes to

clear foreign and damaged material, promotion of inflammation to attract additional phagocytes,

and activation of the cell-killing membrane attack complex[119].

Even if the role of protein P02745 in sepsis has not been completely elucidated yet[120], the

involvement of the coagulation and complement systems in sepsis is well known, as already

illustrated in paragraph 1.2. Thereby, to confirm the mass spectrometry data and to further

strengthen this finding, a validation using an antibody-based method (ELISA), was performed to

have a quantitative measure of the protein P02745. The concentration values measured in the two

groups were compare by Wilcoxon rank-sum test at D1 and D7. A significant difference between S

and NS was found only at D1, in line with what already found for the data expressed as peak

intensities. As reported in Figure 4.10, even the trend is the same: the protein concentration is

higher in NS patients than in S ones.

Figure 4.10 – Boxplot showing P02745 concentration measured by ELISA (left) and by mass spectrometry (right). In both cases the difference is significant (Wilcoxon rank-sum test p-value<0.05) and the protein is more abundant in NS patients.

4.6 REMARKS

In conclusion, our results confirm the feasibility of our data mining approach for the analyses

of proteomics data and for the integration with the metabolomic ones. In respect to our previous

analyses on metabolomic data only (see Chapter 3), the integration with proteomics seems to

indicate the importance of the interaction between inflammation, coagulation and the complement

system in sepsis, which is in line with the recent findings[12].

This aspect is very interesting, since it is an example of how data integration can better

elucidate the several pathways involved in septic shock, thus enabling to have a more complete

view of disease progression. Although further analyses are needed, this may constitute an important

step toward the identification of the molecular mechanisms on which could be the target for new

therapies.

As for the integration with clinical data, the models are of difficult interpretation. From the

regression analysis, it may seem that the clinical parameters are more relevant than the omics ones

but we argue that this could be due to the fact that we consider a quite long time interval (7 days).

However, the protein P02745 still has the highest weight in the model. More focused investigations

are needed to better elucidate the interplay between clinical parameters and omics data in order

to better characterize the patient’s profile.

5 CHARACTERIZATION OF A METABOLOMIC PROFILE ASSOCIATED

WITH RESPONSIVENESS TO THERAPY IN THE ACUTE PHASE OF

SEPTIC SHOCK

Elucidation of early metabolic signatures associated with the progression of septic shock and

with responsiveness to therapy can be useful in the development of a target therapy. In this study,

we examined the plasma metabolome of 21 septic shock patients enrolled for the ShockOmics

clinical trial (NCT02141607). Part of this work was submitted as an abstract to the 40th Annual

Conference on shock1, and as journal paper to Scientific Reports2. Our aim was to verify if different

responses to therapy, assessed as change in the Sequential Organ Failure Assessment (SOFA) score

measured at admission (T1, acute phase) and 48 hours after (T2, post-resuscitation), are associated

to a different trend in metabolite patterns. To this purpose, we combined untargeted and targeted

mass spectrometry-based metabolomics strategies to cover as much as possible the plasma

metabolites repertoire. Metabolite concentrations changes from T1 to T2 (expressed as Δ = T2-T1)

were used to build classification models. Our results support the emerging evidence that lipidome

alteration plays an important role in the individual patient response to infection. The understanding

of regulatory pathway of lipids is thus important for the development of an effective and tailored

therapy. Furthermore, alanine indicates a possible alteration in glucose-alanine cycle which occurs

in the liver thus providing a different picture on liver functionality than the bilirubin.

Blood samples were analyzed at the laboratory of Mass Spectrometry at IRCCS Mario Negri

Institute in Milan under the supervision of Dr. Roberta Pastorelli.

We will present the rationale behind, the study design, the dataset and the methods applied.

Afterward, the obtained results will be compared and discussed.

1 A. CAMBIAGHI, B. Bollen Pinto, L. Brunelli, F. Falcetta, K. Bendjelid, F. Aletti, R. Pastorelli, M. Ferrario, “Responsiveness to therapy in the acute phase of septic shock: a metabolomics analysis”, 40th Annual Conference on shock, Fort Lauderdale, Florida, June 9-12 2017 (submitted). 2 A. CAMBIAGHI, B. Bollen Pinto, L. Brunelli, F. Falcetta, K. Bendjelid, F. Aletti, R. Pastorelli, M. Ferrario, “Characterization of a metabolomic profile associated with responsiveness to therapy in the acute phase of septic shock”, Scientific Report (submitted)

5. Metabolomics profile associated to responsiveness to therapy

5.1 INTRODUCTION

The most important phase in critically ill patients, such as septic shock patients, is the initial

one, i.e. shock diagnosis and beginning of treatment administration. In fact, early supportive therapy

with fluid resuscitation and vasopressors to restore hemodynamics and decrease tissue

hypoperfusion is decisive on the patient’s outcome and has been part of treatment guidelines for

decades[27]. However, mortality rates for septic shock may reach 60% in the era of early recognition

and treatment[9], with the present poor prognosis being mainly related to multiple organ

dysfunction (MOF). Modest improvement in septic shock survival can be explained by the inability

to prospectively identify patients, who are most likely to benefit from singular therapy and by the

absence of predictive monitoring markers of drug delivery and response. The research community

is becoming always more aware that the subject response to therapy is important and precision

medicine is coming an important research topic also for acute illness condition and septic

shock[121]. Precision medicine extends personalized medicine beyond the genome to include

broader systems, multilevel approach to the tailoring of therapeutics to individual patients.

Recently the interest in metabolomic approaches is increasing as the metabolome

represents the end result of gene and protein function and activity and therefore may provide a

more sensitive readout of drug response phenotypes because most drugs impact components of

metabolism[122]. Several studies used metabolomics analyses of various classes of blood

metabolites in search of predictive signatures of intensive care unit (ICU) mortality in adult

patients[98], [123], [124], but less attention has been given to the investigation of putative

metabolic determinants able to classify patient responsiveness to initial therapy during the first 48

hours in ICU.

To the best of our knowledge, this is the first study which investigate a population of septic

shock patients during the acute phase. We examined the plasma metabolomics profile of septic

shock patients during the acute phase of resuscitation. Blood samples were collected at study

enrolment (time T1) and after about 48 hours (time T2). The patients received initial therapy

according to the standards[27] immediately after shock diagnosis (time T0).The time interval

between T0 and T1 was on average 10 hours.

We merged untargeted and targeted mass spectrometry-based metabolomics strategies to

cover as much as possible the plasma metabolites repertoire. We first adopted an unbiased strategy

(untargeted metabolomics) towards profiling as many plasma metabolites as possible without any

a-priori hypothesis. To this purpose, a rapid and yet accurate mass metabolic profiling performed

by direct flow injection-TOF-MS[125] was applied as untargeted screening to explore the main

perturbed metabolic features.

Targeted metabolomics instead is a method in which a specified list of metabolites is

measured and quantified according to a standard in order to achieve absolute quantification of

defined metabolite classes[126]. Since metabolic signatures showing alteration in circulating

kynurenine, fatty acids, lysophosphatidylcholines species and /or carnitine esters have already been

reported in different settings of septic shock patients[89],[123],[124],[127], and in our previous

analyses (see Chapter 3 and 4), we supposed that they might be conceivably involved in the first

phase of shock as well, and they could help in understanding the different trajectory in SS patients.

Consequently, we also applied a targeted approach focused on the measurement of these specific

metabolic classes to provide the magnitude of their level changes in our clinical setting and

eventually validate the information obtained by our untargeted analysis.

The primary objective of these analyses was to verify whether a different response to

therapy, measured as changes in organ dysfunction, i.e. Sequential Organ Failure Assessment

(SOFA) score, is associated to a different trend in metabolite patterns. Our aim is thus to provide a

thorough description of the possible biological pathways which characterize this population so to

suggest putative biomarkers to be next investigated.

The current work is an ancillary study from the multicenter prospective observational trial

named ShockOmics (ClinicalTrials.gov Identifier NCT02141607). Details of the protocol are fully

described in the work of Aletti et al.[64]. Between October 2014 and December 2015, patients

admitted with septic shock to the ICU of Geneva University Hospitals were screened for inclusion.

Adult (>18 years old) patients with an admission SOFA score ≥6 and arterial lactate levels ≥ 2mmol/l

were enrolled. Patients with a high risk of death within the first 24 hours after admission, systemic

immunosuppression, hematological diseases, metastatic cancer, pre-existing dialysis,

decompensated cirrhosis or patients that had received more than 4 units of red blood cells or any

fresh frozen plasma before ICU admission were excluded. Informed consent was obtained from

patients or proxies. Patient management was performed by the clinical care team according to

international guidelines[27].

For each patient, plasma samples were available at the time point named T1 (i.e. acute-

phase, within 16 hours after ICU admissions or development of shock) and at the time point named

T2 (i.e. the post treatment phase). In this study, we analyzed blood samples from 21 patients,

available both for untargeted and targeted metabolomics analyses. Patients were classified into two

groups according to their responsiveness to therapy. All the patients had a SOFA score higher than

9 at T1; patients who still have a SOFA score higher than 8 at T2 and didn’t show a decrease of at

least 4 points were classified as not responsive to therapy (NR), in the other cases they were

classified as responsive to therapy (R). In other words, the NR group consists of 7 patients, who had

at T2 a SOFA score> 8 and Δ SOFA< 5 (Δ= T1-T2 values of SOFA).

5.2.2 Statistical analysis

5.2.2.1 Data from untargeted metabolomics

An untargeted analysis by flow injection-TOF-MS was performed to screen for metabolic

features significantly characterizing the responsiveness (R group) and non-responsiveness (NR

group) to therapy in septic shock patients. Technical details can be found in Appendix A. A total of

14001 and 2190 metabolite masses were measured as peak intensities in positive and negative ion

mode respectively. Given the high number of masses measured, we performed some preliminary

statistical analyses in order to select only the most significant ones for successive metabolite

identification.

Firstly, we tested the presence or absence of the species in the two groups, i.e. if the

incidence of the peaks at each mass-to-charge ratio is different between the groups at T1 and at T2.

We constructed contingency tables for each m/z by counting the number of patients in R and NR

group having such ion detected (i.e. above the limit of detection) and we applied the Fisher Exact

Test by considering the data at T1 and T2 separately. We constructed also contingency tables for

each m/z by counting the number of patients having such ion detected at T1 and at T2 and we

applied the McNemar test to test whether the incidence of detected masses change from T1 to T2.

In positive ion mode, 63 masses at T1 and 172 at T2 have a statistically significant different incidence

(p-value<0.05) between R and NR, whereas in negative ion mode 8 masses at T1 and 20 at T2.

McNemar test was significant (p-value<0.05) for 653 and 119 masses in positive and negative mode

respectively (results not shown).

As second step, we compared the peak intensities distributions. Unpaired and paired

univariate analysis were performed by means of Wilcoxon rank-sum test and by Wilcoxon signed-

rank test respectively.

To overcome the problem of the large number of statistical comparisons, in all analyses, the

calculation of the false discovery rate (FDR) was applied to the p-values obtained from the tests.

Results were considered statistically significant when p-value <0.05 and FDR <0.15. For the

univariate analysis, only masses for which peaks were detected in more than 5 pts in R group and in

more than 3 pts in NR, were considered. 25 and 79 masses were significantly different between R

and NR at T1 and T2 respectively in positive ion mode; 10 and 19 masses in negative ion mode

(Wilcoxon rank-sum test, p<0.05). As for the paired analysis (T1 vs T2 within the same group), in

positive ion mode, 119 and 48 masses significantly changed from T1 to T2 in R and NR respectively;

in negative ion mode 50 and 41. This preliminary analysis was used for the selection of the

metabolites to be identified (Appendix A).

For the identified metabolites, we evaluated the ability of separating the two groups of each

metabolite individually by computing the area under the ROC curve, applying the leave-one-out

cross-validation (Figure 5.1). Those species identified by untargeted metabolomics and quantified

also by targeted approach were compared in order to verify if their peak intensities and their

concentrations were correlated by means of the Spearman correlation analysis (Figure 5.2).

Figure 5.1– AUC analyses for untargeted metabolomics. The ability of separating the two groups of delta in peak intensities of the identified species individually was evaluated by computing the area under the ROC curve using the leave-one-out cross-validation (CV) technique. Notice that the performance in classifying the two groups is poor: the average ACU of each metabolites is below 0.8, with the only exception of creatinine.

Figure 5.2– Spearman correlation between concentrations and peak intensities of the same species quantified with both approaches (targeted and untargeted respectively). Notice that all species have a significant good correlation (ρ > 0.85 and p-value<10-5).

5.2.3 Data from targeted metabolomics analysis

A targeted quantitative approach using a combined direct flow injection and liquid

chromatography (LC) tandem mass spectrometry (MS/MS) assay was applied for targeted

metabolomics analysis (see Appendix A for technical details). We compared the metabolite

concentrations measured at T1 and T2 of R and NR groups by means of Wilcoxon rank-sum test. The

variations in metabolites concentration from T1 to T2 were compared separately for the R and NR

group by means of Wilcoxon signed-rank test. Finally, for each metabolite, the time-trend variations

in metabolites concentration (i.e. ∆=T2-T1) were compared between the two groups by Wilcoxon

signed-rank test. To overcome the problem of the large number of statistical comparisons, we

computed also the false discovery rate (FDR). The FDR was assessed after the bootstrapping

procedure. The sample size was increased from 14 to 20 subjects for the R group and from 7 to 10

subjects for NR by a bootstrapping with replacement, for a total of 30 observations. Bootstrapping

procedure was used only for the FDR assessment. Results were considered statistically significant

when p-value <0.05 (no bootstrapping) and FDR <0.15. Also for the metabolites concentrations, we

evaluated the ability of separating the two groups of each metabolite individually by computing the

area under the ROC curve by applying the leave-one-out cross-validation (Figure 5.3).

Figure 5.3 – AUC analyses for targeted metabolomics. Only the best 30 metabolites are shown (AUC > 0.5). The ability of separating the two groups of the delta of each metabolite individually was evaluated by computing the area under the ROC curve using the leave-one-out cross-validation (CV) technique. Notice that the performance in classifying the two groups is poor: the average ACU of each metabolites is below 0.8.

5.2.4 Multivariate analyses

5.2.4.1 Data from targeted metabolomics analysis

Our aim was to classify NR patients. The classification models were built on metabolite

concentrations changes from T1 to T2, expressed as Δ = T2-T1. Since metabolite concentrations are

highly correlated and the number of observations (21 patients) is much lower than the number of

features (130 metabolites) it was necessary perform features reduction before building the model.

We adopted the mRMR algorithm, as previously described, and we discretized the features

distribution according to the interquartile range. Multivariate analysis was performed similarly to

what already presented in Chapter 4. Briefly, we considered the first 10, 20 and 30 ranked

metabolites to build three different classification models. The dataset was divided into a training

and test set as two third and one third of the observations, respectively. Data were normalized (Z

score normalization) before performing the elastic net.

We adopted two strategies to further select a smaller subset of features. We performed 50

times an elastic net logistic model using a logit function to fit the training set data. We considered a

binary classification (R = 0, NR = 1) and the output of the model is a value between 0 and 1, which