Prediction of Treatment Response in Patients with Multiple ...

Prediction of Treatment Response in Patients with MultipleMyeloma undergoing Chemotherapy using MRI derived

Imaging Biomarkers

Renata Isabel Jónatas Quintino

Thesis to obtain the Master of Science Degree in

Biological Engineering

Supervisor(s): Dr. Nickolas PapanikolaouProf. Susana de Almeida Mendes Vinga Martins

Examination Committee

Chairperson: Prof. Maria Margarida Fonseca Rodrigues DiogoSupervisor: Dr. Nickolas Papanikolaou

Member of the Committee: Dr. Vasilios Koutoulidis

November 2019

The work presented in this thesis was performed at Computational Clinical Imaging Group of Cham-

palimaud Foundation (Lisbon, Portugal), during the period February-October 2019, under the supervi-

sion of Dr. Nickolas Papanikolaou. The thesis was co-supervised at Instituto Superior Tecnico by Prof.

Susana Vinga.

Furthermore, I declare that this document is an original work of my own authorship and that it fulfills

all the requirements of the Code of Conduct and Good Practices of the Universidade de Lisboa.

iii

Acknowledgments

Definitely, there are three people that I must thank enormously for all their patience throughout my ups

and downs during this challenging times in a student’s life. These three people are always ready to bring

me up and ground me down, to listen to all my problems and offer helpful advice or just a most needed

shoulder to lean on. They celebrate all my achievements with the such pride and joy, and this one is

tremendously dedicated to them, the end of an era. To my lovely mom, my wise dad and my goofy

brother.

At the Champalimaud Foundation I had my fair share of helping hands: my thesis advisor Dr. Nickolas

Papanikolaou, PhD Joao Santinha and PhD Jose Moreira, a well deserved thank you. A gentle reminder

to the rest of the team .

To Professor Ana Azevedo and to the PhD Mariana Ferreira, whom with incredible kindness guide

me to Professor Susana Vinga, who kindly accepted to be my thesis advisor. To these three amazing

ladies a big thank you.

To the institutions, namely Escola Secundaria Mouzinho da Silveira and Tecnico Lisboa, that gave

me the tools to come so far along in this journey and to all the professors that made an impact in me to

become the student I am today.

An enormous thank you to an amazing couple that has always available to help me inside and outside

of the Champalimaud Foundation, Graca and Paulo. Without you probably I would not have ended up

doing this thesis.

To my friends back home, who accompanied me throughout this journey and always gave me a

sense of home even when physically far.

To my friends in Biological Engineering and in life that accompanied me throughout these relentless

and amazing five years of Tecnico, a big thank you. To the ones I share great memories with! Namely,

Carolina Richheimer, Isabel Doutor, Margarida Rodrigues, Pedro Pereira, Sofia Amorim, Simone Gorny

and Tiago Taborda.

To my Palazzo Lombardia friends, whom I will never forget after being a huge part of one of the

most amazing experiences of my life, which made me grew up so much. A big thank to Ana Bordignon,

Anna Rita Carvalho, Amanda Coelho, Amanda Queirante, Botond Gazda, Carlos Maranghetti, Come

de Tugny, Daniela Oliveira, Gianluigi Quaglia, Jan Witte, Lauren Astruc, Leen Leconte, Marta Lo Presti,

Max Komorek, Sophie Vermeire and Unnie Marie Tvedt.

Last but not least, a gigantic thank you to my big C1 family. Living with so many people is not easy

at times, but all theses amazing human beings, that I have the pleasure to share a roof with, where the

omnipresent support that I am truly thankful for, throughout my struggles and throughout my conquests.

These people contributed for my happiness over five wonderful years, all of you have a very special

place in my heart. A special thank you to Alice Lourenco, Beatriz Filipe, Afonso Luz, Carlos Pires,

Diogo Pires, Goncalo Cardoso Iara Figueiras, Ines Rainho, Leonardo Pedroso, Joao Nunes, Joao Pedro

Gomes, Maria Mesquita, Matheus Orsi, Miguel Rebocho, Pedro Pereira, Sara Cardoso, Solange Bolas,

Steven Santos and Tiago Costa.

v

Resumo

Tecnicas de imagiologia estao a ser cada vez mais usadas na avaliacao de mieloma multiplo (MM). O

principal objetivo deste trabalho e a utilizacao de imagens de ressonancia magnetica para descobrir um

biomarcador preciso que possa auxiliar na previsao da resposta ao tratamento em paciente com MM.

Imagens de ressonancia magnetica foram recolhidas de 30 pacientes com MM, antes e apos o

primeiro ciclo de tratamento por quimioterapia. Estatısticas de primeira ordem que descrevem a distribuicao

da intensidade do sinal foram extraıdas dos mapas de coeficiente de difusao aparente e fracao de gor-

dura gerados a partir das imagens de difusao ponderada e sequencias gradiente eco, respetivamente.

Estas variaveis foram submetidas a uma analise univariada atraves do Mann-Whitney U-Test para

avaliar diferencas com significancia estatıstica entre as duas populacoes de pacientes, as quais diferem

na resposta ao tratamento. Estes resultados foram submetidos a correcoes de comparacao multipla.

Paralelamente, foram analisadas as curvas ROC (receiver operating characteristic) para discriminar os

atributos que demonstram o melhor compromisso entre sensibilidade e especificidade.

Varias variaveis demonstraram um bom poder discriminatorio entre as duas populacoes, assim

como bons valores nas medidas de performance. Os melhores resultados sao extraıdos dos mapas

recolhidos antes do tratamento, com uma sensibilidade de 66.7% e uma especificidade de 90.9% ou

com uma sensibilidade de 83.3% e uma especificidade de 81.8%.

Este trabalho contribui para a valorizacao do potencial das variaveis recolhidas de imagens de res-

sonancia magnetica serem biomarcadores precisos na previsao da resposta ao tratamento do MM.

Palavras-chave: ressonancia magnetica, fracao de gordura, coeficiente aparente difusao,

mieloma multiplo, biomarcadores, resposta ao tratamento.

vii

Abstract

Imaging techniques are being increasingly used in the evaluation of multiple myeloma (MM). The main

objective of this work is to explore magnetic resonance images to discover accurate imaging biomarkers

that can aid in an early prediction of response to treatment in patients with MM.

Magnetic resonance images from the spine of 30 patients with MM were collected, before and after

the first cycle of induction chemotherapy. First order statistics that describe the distribution of signal

intensity were extracted from apparent diffusion coefficient (ADC) and fat fraction (FF) maps, generated

from diffusion weighted imaging and in and opposed-phase gradient echo magnetic resonance images,

respectively.

These imaging features were submitted to an univariate analysis with a Mann-Whitney U-Test to

evaluate statistical significant differences between the two populations of patients, which differ in re-

sponse to treatment (responders and non-responders) in three different scenarios. These results were

posteriorly submitted to multiple comparison corrections. In parallel, in other to discriminate attributes

that displayed the best balance between sensitivity and specificity, the receiver operating characteristic

curves (ROC) were analysed.

Several features demonstrated a good discrimination between responders and non-responders, as

well as good performance metrics. The best attributes were extracted from the maps created before the

beginning of induction treatment with specificity and sensitivity equal or superior to 81.8% and 66.7%,

respectively.

This work sediments the potential of imaging features collected from magnetic resonance images as

accurate biomarkers in the prediction of treatment response in MM.

Keywords: magnetic resonance imaging, apparent diffusion coefficient, fat fraction, multiple

myeloma, imaging biomarkers, treatment response.

ix

Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii

List of acronyms and abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Topic Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Theoretical Background 7

2.1 Image Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 T1-weighted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.2 Short-Time Inversion Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.3 In and Opposed Phase Gradient Fast Field Echo . . . . . . . . . . . . . . . . . . . 8

2.1.4 Diffusion-Weighted Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5 Clinical Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.6 Statistical Analysis Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.6.1 Univariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.6.2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Materials and Methods 23

3.1 Patient Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 MRI Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3 Statistical Analysis Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.3 Univariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

xi

4 Results 27

4.1 Univariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1.1 p-value and ROC Curve Evaluation for the First Order Imaging Features . . . . . . 27

4.1.2 Detailed Analysis of the Key Attributes . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.1.3 Clinical Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2 Density Plots Comparative Analysis Over the First Round of Chemotherapy . . . . . . . . 48

5 Discussion 53

6 Conclusions 59

Bibliography 61

A Additional Informations 67

A.1 IMWG Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

B Data Sets 69

C Materials and Methods 71

C.1 R code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

C.1.1 Adjusted p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

C.1.2 Generation of the ROC curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

D Results 75

D.1 ROC Curves Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

D.2 ROC Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

D.3 Box and Whiskers Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

xii

List of Tables

2.1 General form of a confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1 p-values achieved with the Mann-Whitney U-Test (MW U-Test) before and after the imple-

mentation of the multiple comparison corrections and the AUC value obtained using the

ROC curve, with the respective 95% confidence interval, regarding each attribute of the

pre-treatment data set in scenario 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30












post-1stcycle data set in scenario 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33









4.7 p-values achieved from the Mann-Whitney U-Test for each attribute of interest before and

after the implementation of the multiple comparison corrections and the AUC value ob-

tained using the ROC curve, with the respective 95% confidence interval, regarding each

attribute of the delta data set in scenario 1. . . . . . . . . . . . . . . . . . . . . . . . . . 36





xiii





4.10 Summary of the attributes considered statically relevant through the p-value evaluation

and the ROC curve with their respective true positive rate (TPR), false positive rate (FPR),

threshold at which this rates are verified, accuracy, precision, F1-measure (F1), AUC (with

the respective 95% confidence interval) and p-values before and after the multiple com-

parison tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.11 Mean associated with the key attributes for each class (responders and non-responders),

with the respective standard deviation (SD), and the range of values within each key

attribute is located. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41




attribute of the clinical data set in scenario 1. . . . . . . . . . . . . . . . . . . . . . . . . 44









4.15 Summary of the attributes considered statically relevant through the Mann-Whitney U-

Test and the ROC curve with their respective true positive rate (TPR), false positive rate

(FPR), threshold at which this rates are verified, accuracy, precision, F1-measure (F1),

AUC score with the respective 95% confidence interval and p-values before and after the

multiple comparison tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.16 Mean associated with the clinical attributes LDH and serum immunoglobulin A (Serum

IgA) for each class (responders and non-responders), with the respective standard devi-

ation (SD), and the range of values within each attribute is located. . . . . . . . . . . . . . 47

4.17 Statistical metrics summary that characterize the density plots depicted in the images 4.4

to 4.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

A.1 International Myeloma Working Group uniform response criteria by response subcategory

for multiple myeloma. [Part I] [50] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

A.1 International Myeloma Working Group uniform response criteria by response subcategory

for multiple myeloma. [Part II] [50] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

xiv

B.1 Response to the induction therapy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

D.1 Summary of the attributes found statistically interesting when considering the AUC analysis. 75

D.2 Statistical parameters concerning the design of the Box and Whiskers plots for the key at-

tributes and clinical variables. The classes are identified as responders and non-responders

corresponding to the class of patients that are considered to respond and not respond to

treatment, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

xv

List of Figures

1.1 MRI image (in-phase gradient echo) from a patient’s spine. A Original image and B Seg-

mented image with the regions of interest (spine’s vertebrae) filled with a red label. . . . . 4

3.1 General process used for the collection of the p-values in the univariate analysis. . . . . . 26

4.1 Comparison of differences in signal intensity parameters collected from ADC and FF maps

between responders and non-responders for the key attribute A Kurtosis in the ADC pre-

treatment data set in scenario 1 (ADC PreTreat 1), B 90 Percentile in the FF pre-treatment

data set in scenario 1 (FF PreTreat 1), C Median in the FF pre-treatment data set in

scenario 1, D Skewness in the FF pre-treatment data set in scenario 1, E Root Mean

Squared in the FF pre-treatment data set in scenario 1 and F Total Energy in the FF

pre-treatment data set in scenario 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2 Box and whisker plot for the clinical attributes A LDH and B Serum Immunoglobulin A

(Serum IgA) in scenario 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.3 Box and whisker plot for the clinical attribute Serum Immunoglobulin A (Serum IgA) in

scenario 1 amplified. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.4 Density plot relative to the signal intensities removed from the ADC maps from patient 2,

whom presents a response classification of 6 to the induction therapy. . . . . . . . . . . 51

4.5 Density plot relative to the signal intensities removed from the FF maps from patient 2,


4.6 Density plot relative to the signal intensities removed from the ADC maps from patient 6,




4.8 Density plot relative to the signal intensities removed from the ADC maps from patient

22, whom presents a response classification of 1 to the induction therapy. . . . . . . . . 51



D.1 ROC curve for the attribute Kurtosis in the ADC pre-treatment data set in the scenario 1,

with a correspondent AUC value of 0.855 (0.679-1.000). . . . . . . . . . . . . . . . . . . . 76

D.2 ROC curve for the attribute 90 Percentile in the FF pre-treatment in the scenario 1, with a

correspondent AUC value of 0.879 (0.747-1.000). . . . . . . . . . . . . . . . . . . . . . . . 76

D.3 ROC curve for the attribute Median in the FF pre-treatment in the scenario 1, with a

correspondent AUC value of 0.856 (0.698-1.000). . . . . . . . . . . . . . . . . . . . . . . . 76

xvii

D.4 ROC curve for the attribute Root Mean Squares in the FF pre-treatment data set in the

scenario 1, with a correspondent AUC value of 0.856 (0.704-1.000). . . . . . . . . . . . . 76

D.5 ROC curve for the attribute Skewness in the FF pre-treatment data set in the scenario 1,

with a correspondent AUC value of 0.856 (0.702-1.000). . . . . . . . . . . . . . . . . . . . 77

D.6 ROC curve for the attribute Total Energy in the FF post-treatment data set in the scenario

1, with a correspondent AUC value of 0.864 (0.703-1.000). . . . . . . . . . . . . . . . . . 77

xviii

List of acronyms and abbreviations

ROC Receiver Operating Characteristic

α Significance Level

β2M Beta-2 Microglobulin

ADC Apparent Diffusion Coefficient

ADC Apparent Diffusion Coefficient

AUC Area Under the Curve

BH Benjamini-Hochberg

CR Complete Response

DWI Diffusion Weighted Imaging

ECOGPS Eastern Cooperative Oncology Group Performance Status

FDR False Discovery Rate

FFE Fast Field Echo

FF Fat Fraction

FLC Free Light Chain

FN False Negative

FPR False Positive Rate

FP False Positive

FWER Family-Wise Error Rate

GRE Gradient Echo

Hb Hemoglubin

IMWG International Myeloma Working Group

IQR Interquartile Range

MAD Mean Absolute Deviation

MM Multiple Myeloma

xix

MRI Magnetic Resonance Imaging

PCs Plasma Cells

PD Progressive Disease

PMNs Polymorphonuclear leukophils

PR Partial Response

RF Radio Frequency

RMS Root Mean Squared

ROI Region of Interest

SD Stable Disease

SD Standard Deviation

SE Spin Echo

STIR Short-Time Inversion Recovery

Serum IgA Serum Immunoglobulin A

TE Echo Time

TI Inversion Time

TPR True Positive Rate

TP True Negative

TP True Positive

TR Repetition Time

VGPR Very Good Partial Response

VOI Volume of Interest

WBC White Blood Cell

rMAD Robust Mean Absolute Deviation

xx

1 Introduction

1.1 Motivation

Chemotherapy is a debilitating procedure that patients with multiple myeloma commonly undergo to fight

the cancer. It is not certain that the treatment will have a successful outcome, since it is impossible to

predict exactly how the patient will respond to it. The existence of a non-invasive imaging biomarker

that could aid in the prediction of the treatment’s outcome, would indicate if chemotherapy is the path

to be followed and, if not, other alternatives could be explored without submitting the patient to this

particular exhausting procedure. If the predictions are accurate and they could be made in an early

treatment stage, time and money could be saved and, most importantly, patient care could be improved

by avoiding unnecessary treatment.

1.2 Topic Overview

According with the National Cancer Institute, the basic definition of cancer is a collection of diseases in

which abnormal cells are able to grow indefinitely and may spread into nearby tissues. [1]

The total number of cancer deaths continues to raise due to the increasing population size, life

expectancy and population mean age. [2] In 2016, cancer was the sixth leading cause of death in the

world according with the World Health Organization. This is a consequence of the increasing number

of cancer patients, as well as of the progresses made against other death causing diseases, such as

HIV/AIDS or tuberculosis. [3]

Multiple myeloma (MM) is characterized by the proliferation and accumulation of monoclonal plasma

cells. [4] As defined by the Canadian Cancer Society, plasma cells are a type of white blood cells that

secrete large volumes of antibodies. These cells are an important part of the immune system and can be

found in the bone marrow. The abnormal plasma cells, also known as myeloma cells, can form tumours

in the bones. A single tumour formed by myeloma cells receives the name of plasmacytoma; if several

plasmacytomas are found, the condition is called multiple myeloma. [5] The abnormal plasma cells that

characterize this cancer can be distributes in the bone marrow either focally and/or in a diffuse manner.

[6]

The overgrowth of plasma cells may lead to a low blood count, which may result in anemia (a shortage

of red blood cells), thrombocytopenia (a low level of platelets) and leukopenia (a shortage of normal white

blood cells). The main consequences of these conditions are weakness and fatigue, increased bleeding

and bruising and a frail immune response, respectively.

The existence of myeloma cells also interferes in the bone tissue replacement. The osteoclasts and

osteoblasts work together in order to produce healthy and strong bones. The myeloma cells induce

1

the osteoclasts to dissolve the existing bone, without a proper follow up by the osteoblasts, which are

responsible for the synthesis of new bone. This imbalance leads to the formation of weak bones and to

an increase of the calcium concentration in the blood stream (hypercalcemia). [4]

As a last remark, the out of control growth of the abnormal plasma cells leads to the dominant

production of a single immunoglobulin, monoclonal immunoglobulin. This monoclonal immunoglobulin,

recurrently called M-protein, may be used to measure the tumour load. Due to the predominance of a

single antibody and to the very inefficient production of others, the body is unable to protect itself against

infections. [7] The kidneys are also commonly affected by the production of M-protein, leading to kidney

damage or failure (renal impairment). [4]

This disease may develop from an asymptomatic premalignant stage - characterized by the exis-

tence of abnormal cells that increase the chance of evolving into cancer, to monoclonal gammopathy

of undetermined significance - characterized by the presence of M-protein, over to smouldering multiple

myeloma - characterized by the detection of abnormal cells in the bone marrow and abnormal protein

in the blood and/or urine in asymptomatic patients, and finally evolve to symptomatic multiple myeloma

with end-organ damage, such as renal impairment, hypercalcemia, anaemia and bone disease, as men-

tioned above. [8]

Based on the International Myeloma Working Group (IMWG) diagnosis criteria reported in 2014, the

diagnosis for MM relies on the demonstration of bone marrow plasmacytosis and/or on the presence of

M-proteins in the serum or urine and/or the detection of end-organ damage, with particular attention to

bone lesions. [8]

Although multiple myeloma remains incurable, novel therapies based on drug combinations have

allowed to prolong survival. In the early 1960s, melphalan was introduced as an anti-cancer chemother-

apy drug. This type of drug works by damaging the genetic material of the cell responsible for the

cellular division. [9] For patients responsive to melphalan, the treatment resulted in more than 2 years

survival increase. In the late 1980s, this treatment was combined with autologous bone marrow trans-

plantation, which led to an overall survival greater than 3 years. Nowadays, it is commonly used a

3-drug combination consisting on an immunomodulatory drug, a proteasome inhibitor and glucocor-

ticoid, followed by autologous stem cell transplantation maintenance therapy with low-dose of an im-

munomodulatory drug or of a proteasome inhibitor. The immunomodulatory drugs are responsible

for pleiotropic anti-myeloma properties including immune-modulation, anti-angiogenic, anti-inflammatory

and anti-proliferative effects. The proteosome inhibitor, by blocking the action of proteasome, prevent

the degradation of pro-apoptotic factors. The glucocorticoid is part of the feedback mechanism in the

immune system related with exaggerated responses, this way it can be used to reduce inflammation.

Recently, immunotherapy as gain relevance in myeloma therapy. Nevertheless, the treatments eventu-

ally fail due to acquired drug resistance and clonal evolution. Genetic changes were detected between

the time of diagnosis and relapse, what suggests the occurrence of expansion of low frequency resistant

clones at diagnosis or the appearance of mutations as the disease progresses. These findings imply

that the best path to follow should be individualized treatment for maximum survival benefit. [10]

Magnetic resonance imaging (MRI) is a non invasive method that can be adopted to evaluate the

2

existence of alterations in bone marrow composition. It is used in diagnosis work-up of patients with MM

and to qualify the development of the disease during and after treatment. [11]

The lesions provoked by the disease have high cellularity and high water content, which contrast

with the normal bone marrow composition, therefore these lesions present a different signal intensity

when compared with healthy bone marrow. Different types of MRI are used to establish a more in-

formed opinion about staging and response to treatment, such as conventional signal echo (SE) and

diffusion weighted imaging (DWI). The conventional MRI provides morphological information regarding

the detection of focal lesions in patients with MM, while DWI provides additional functional information.

Conventional MRI is used to assess infiltration patterns and DWI evaluates the bone marrow composition

and cellularity. [8]

The parametric maps that can be generated from magnetic resonance images have been explored in

order to find a reliable and precise biomarker. Some studies regarding the apparent diffusion coefficient

(ADC) maps have been conducted and they indicate the existence of a relation between the treatment’s

outcome and the evolution in signal intensity during treatment. [12] Furthermore, there are also some

studies that support the use of fat fraction (FF) maps as an excellent biomarker. [13] [14]

In multiple myeloma can be identified five different patterns of bone marrow infiltration: a normal

appearing marrow, focal infiltration, diffuse disease, salt-and-pepper involvement or combined focal and

diffuse infiltration. The focal lesions are characterized by circumscribed areas with a distinct signal inten-

sity from the surrounding vertebrae bodies. The diffuse infiltration is characterized by a homogeneous

shift on the signal intensity from the normal appearance of the bone marrow. [6] The focus and diffuse

infiltration pattern is characterized by an homogeneous shift on signal intensity with additional foci inter-

spersed. The salt-and-pepper pattern is characterized by an inhomogeneous shift on signal intensity.

[8] The signal intensity shift on damaged areas is dependent on the technique used. For instance, since

there is an increase on the water and decrease on the fat content on the abnormal areas, on T1-weighted

images there is a signal’s intensity decrease, while on T2-weighted images with fat suppression there is

a signal intensity increase. Both these techniques are elucidated in section 2.1.

The IMWG, the National Comprehensive Cancer Network and the European Society for Medical

Oncology recommend a skeletal survey study for MM diagnosis. [15] [16] Nevertheless, the images

provided by the University Hospital of Athens, Greece, are only spine surveys focusing on the lumbar

region. Although it would be more precise to do a whole body survey, this region is thought to be

representative of the alteration occurring in the bone marrow. It was granted access to T1-weigthed,

Short-Time Inversion Recovery (STIR), In and Opposed Phase Gradient Fast Field Echo (FFE) and

DWI magnetic resonance images from patients undergoing chemotherapy as the induction treatment

against multiple myeloma.

From the several MRI images, there is the possibility of collecting a high number of quantitative

imaging features and converting them into mineable high-dimensional data in order to achieve improved

decision support of precision medicine, this is process is known as radiomics. [17] Precision medicine

focuses on understanding individual variability in disease prevention, care and treatment. [18] The

ultimate goal is to establish quantitative predictive or prognostic associations between the clinical images

3

and the medical outcome. [19]

Upon having available the images associated with the problem under study, the workflow in this

radiomics project comprehends the following stages:

1. Segment the regions of interest (ROI) in each patient images;

2. Extract the image features;

3. Perform univariate analysis.

The segmentation may be the most critical and challenging step in the radiomics workflow. The

segmentation masks will guide the subsequent step of feature extraction, which will ultimately condition

the conclusions’ validity withdrawalled from the study. Furthermore, there is a high degree of variability

inherent to the semi-automated process used nowadays derived from inter-user expertise and distinct

image acquisition conditions. As a consequence and to ensure reproducibility, at least two radiologists

are needed for the segmentation step, which leads to increased costs. With all of this in consideration,

there is a consensus that the development of a program capable of performing automated segmenta-

tion with minimal manual intervention is crucial to minimize inter-user variability, while increasing time

efficiency, reproducibility and accuracy. [17] [20]

All the vertebral bodies present in the resonance images are considered ROIs. On Figure 1.1 is

presented a image of a patient’s spine before and after the segmentation process in the sagittal plane

using ITK-SNAP.

Figure 1.1: MRI image (in-phase gradient echo) from a patient’s spine. A Original image and B Seg-mented image with the regions of interest (spine’s vertebrae) filled with a red label.

The MRI scans are composed by several images of the spine that cover different longitudes in the

sagittal plane. Each set of ROIs forms the volume of interest (VOI). The imaging features collected are

representative of the VOI.

In the present case, the features extracted will be only referring to first-order statistics, comprehend-

ing 18 features posteriorly subjected to an univariate analysis. The analysis was limited to these type

4

of features due to the use of parametric quantitative maps (ADC and FF) and not conventional originally

acquired images and limited to this type of analysis due to the small number of patients, that does not

allow a reliable multivariate analysis. The software chosen to do the analysis of the extracted imaging

features was RapidMiner, with the aid of R studio.

Radiomics is a fairly young field and it has a great potential to accelerate precision medicine. The

goal of radiomics is to link the image features to phenotypes or molecular signatures, for that reason

it is necessary to develop an integrated database wherein the images and the extracted features are

associated to the clinical and molecular data, without jeopardizing patient privacy by its deidentification.

This database formation is still a challenge to be overcomed.

1.3 Objectives

It is the objective of this work the discovery of accurate imaging biomarkers that can aid in the predic-

tion of treatment response status in an early phase of induction chemotherapy in patients with multiple

myeloma.

Is it possible to predict treatment response either using the pre-treatment imaging data, or the

post-1stcycle data, in order to optimize therapeutic decisions?

1.4 Thesis Outline

All the techniques, parameters and algorithms that contribute to the development of the models attained

are described in the chapter Background (chapter 2). Within this chapter there are several sections

that are briefly explained in more detail below.

Section 2.1: Image Acquisition. The MRI images were obtained using several techniques, such as

T1-weighted, STIR, In and Opposed phase gradient echo and DWI. All the different techniques’ princi-

ples are in this section described. Furthermore, it is explained how the FF and ADC parametric maps

are built from the In and Opposed phase gradient echo and the DWI resonance images, respectively.

Section 2.2: Pre-processing. The MRI images may need correction in order to increase image

quality. Although in this work no pre-processing took place, here is described what led to that decision.

Section 2.3: Segmentation. The images’ segmentation was performed resorting to the platform

ITK-SNAP. It allows a semi-automatic segmentation of the volumes of interest that is further explained

in this section. All the vertebral bodies present in the images were segmented as volumes of interest.

Section 2.4: Feature Extraction. The base of the classifier is the extracted features from the

segmented images. The feature extraction process was achieved using the PyRadiomics software. All

the imaging features used are further explained in this section.

Section 2.5: Clinical Variables. It was granted access to the clinical data of the patients that took

part in this study. It was thought to be interesting and of added value to evaluate this variables in

addition to the imaging features. In this section are described the clinical variables that were additionally

explored.

5

Section 2.6: Statistical Analysis Concepts. In this section is explicit the scenarios analyzed re-

garding response to treatment and the different resources that were used during the implementation

process, which include the univariate approach followed in this work and the performance metrics that

aided the evaluation of the imaging and clinical features.

The MRI imaging techniques and the implementation of the statistical tools used throughout this work

are explored in the chapter Materials and Methods (chapter 3). This chapter is divided in three main

sections explained briefly in more detail below.

Section 3.1: Patient Population. Definition of the patient population considering age, gender and

treatment’s response.

Section 3.2: MRI Imaging Techniques. Discrimination of the technical parameters used during the

resonance images acquisition.

Section: 3.3: Statistical Analysis Implementation. Firstly, in this section is mentioned the struc-

ture and composition of the data sets with the imaging features values (subsections 3.3.1 and 3.3.2).

Secondly, it is explored how were evaluated the features extracted from the ADC and FF maps resorting

to an univariate analysis focused on the Mann-Whitney U-Test and the ROC curve construction for each

attribute (subsection 3.3.3), which is the main purpose of this section. The data analysis is conducted

using the RapidMiner software, with the aid of R studio to create needed function that were not available.

This software has several tools that allow a thorough analysis of the data.

The results achieved from the implementation of the univariate analysis will be inspected on the

chapter Results (chapter 4). These results are focused solely on the univariate analysis due to the

reduced number of patients, since the results produced by the multivariate analysis on these conditions

would not be reliable, they would be prone to overfitting. The univariate analysis will work as proof

of concept. The attributes found more promising in the univariate analysis are graphically explored to

assess the separation between populations. Finally, it is also conducted a graphical appraisal of the

evolution of the signal’s intensity during the first round of treatment in patients with different treatment

response.

On the chapter Discussion (chapter 5) is debated the results achieved in this work, while comparing

them with the literature.

On the chapter Conclusion (chapter 6) is discussed the applicability of the potential biomarkers

found according with the results obtained. Is it plausible to develop the model with an increased number

of patients based on the preliminary results? What is the next step?

6

2 Theoretical Background

2.1 Image Acquisition

MRI is based on the magnetization properties of atomic nuclei. At first there is the alignment of the

protons usually randomly oriented within the molecules of the tissue under observation. This alignment

is perturbed by the introduction of an external radio frequency energy. After the perturbation, the nuclei

will go back to the equilibrium position releasing energy. The information gathered by the emitted signal

of the tissue under examination is then converted into an intensity level, which will be depicted in the re-

sulting image. Different types of images are obtained by the variation of the radio frequency (RF) pulses

properties, such as the repetition time (TR, time interval between successive pulse sequences applied

on the same slice) and the echo time (TE, time interval between the delivery of the radio frequency pulse

and the reception of the echo signal). [21]

The protons spin around the long axis of the primary magnetic field, this phenomenon is called

precession. The Larmor or precessional frequency in MRI refers to the rate of precession of the magnetic

moment of the proton around the external magnetic field . The frequency of precession is proportional

to the strength of the magnetic field. [22] Each substance will have a distinct Larmor frequency. Due

to this difference in resonance frequencies, the spins of different substances go in and out of phase

with each other as a function of time. When the protons precess together, in other words, when they

have overlapping magnetic moments, they are considered in-phase. When the protons do not precess

together, they are considered out-of-phase. In the particular case where the protons are 180° out-of-

phase, they are considered as in opposed-phase. The period of this phase cycling is equal to the

inverse of the frequency difference between spins. Each state, in- and opposed-phase occurs once per

cycle. [23]

Spin echo (SE) is a fundamental pulse sequence in MRI. The sequence is composed by an excitation

pulse and at least one refocusing pulse. Usually the flip angles of the excitation and refocusing pulses

are set to 90° and 180°, respectively. A refocusing RF pulse has the objective of refocusing the spins that

have dephased. Explicitly, the resulting images can be acquired with either a single-echo or a multiple-

echo pulse sequence. The difference relies on the number of RF refocusing pulses applied within each

TR interval after the initial longitudinal magnetization. The main advantage of SE pulse is the possibility

of combining the TR and TE values in order to create specific contrast weighting, such as T1- and T2-

weighted images. [24]

2.1.1 T1-weighted

Different tissues can be characterized by two distinct relaxation times, T1 and T2. T1 is the longitudinal

relaxation time and it determines the rate at which protons return to equilibrium after excitation, in other

7

words, it is a measure of the time taken for spinning protons to realigned with the magnetic field. [21] [25]

T2 is the transverse relaxation time and it determines the rate at which excited protons go out-of-phase

with each other, in other words, it is a measure of the time taken for spinning protons to lose phase

coherence in the transverse plane. [21][26]

T1-weighted images reflect the difference between T1 relaxation times of distinct tissues and they

are produced by short TR and TE. In this type of images, fat is bright and water is dark. On the T1-

weighted MRI images, since it occurs a accumulation of water on the lesions sites, these regions will

appear darker.

2.1.2 Short-Time Inversion Recovery

STIR is a fat suppression technique. This technique sequence begins with a 180° pulse, which reverts

the longitudinal magnetization for all tissues. During the time interval that follows this first pulse, it occurs

T1-relaxation seeking to restore the equilibrium alignment in the positive direction of the field. After a

selected interval duration, it is generated a longitudinal magnetization (90° pulse). If at the time of this

second pulse the longitudinal magnetization of a tissue is close to zero, the signal will be equal or very

close to zero. The time interval between the two pulses is denominated as inversion time and it differs

between tissues according with its T1. For fat suppression the inversion time (TI) is given by the Equation

2.1 where the fat signal is equal zero. [27]

TI = ln(2) · T1fat (2.1)

2.1.3 In and Opposed Phase Gradient Fast Field Echo

In and Opposed phase gradient echo (GRE) is a type of dual echo sequence, which means that for a

pulse sequence two echos are acquired. The fast field echo was the type of GRE pulse sequence used

to obtain the in- and opposed-phase MR images. The FFE is primarily used for anatomical imaging. [28]

Water and fat protons precess at fractionally different frequencies due to their different local chemical

environment. Unlike SE, GRE pulse sequences do not have a RF refocusing pulse , which means

that the water and fat protons will be periodically in both in-phase (overlapping magnetic moment) and

opposed-phase (opposed magnetic moment). These two states correspond to different TE. In the first

situation the signal intensities will add, leading to an increase of the signal intensity. On the other hand,

on the second situation the signal intensities will subtract, leading to a decrease of signal intensity. This

sequence main application is the determination of the fat fraction within each individual voxel. [29]

The signal intensity of each voxel is determined in both in-phase (IP) and opposed-phase (OP)

images. The fat fraction (FF) from the vertebrae is then calculated according to Equation 2.2. [30]

FF =IP −OP

2 · IP(2.2)

8

2.1.4 Diffusion-Weighted Imaging

DWI is a technique very sensitive to cell density, relative content of fat and marrow cells, water content

and bone marrow perfusion. Thus, it is commonly used to measure bone marrow composition and

cellularity.

The signal intensity of DWI is dependent on the stochastic Brownian motion of water molecules within

a tissue at the microscopic level and on the diffusion gradient strength used. The factor that reflects the

strength and timing of the gradients used to generate these images is called b-value. This value is a

function of the strength, duration and time interval between two strong gradient pulses generated during

the DWI pulse sequence. The increment of any of these variables leads to an increase of the b-value.

[24] [31]

The objective behind the variation of the b-value during DWI image acquisition is the elaboration of

the apparent diffusion coefficient (ADC, mm2/s) map. The ADC is a direct indicator of water motion

within the extracellular and intracellular space, thus it can be directly related to tissue cell density. This

value can be calculated from the exponential decay of the signal intensity (S) as a function of the b-value

(b, s/mm2), as shown by Equation 2.3. [31][32][33]

S = S0 e−b·ADC (2.3)

At this equation, S0 corresponds to the signal intensity when the b-value is equal to zero. [34]

Lesion sites display higher signal intensity in DWI images, due to a low amount of fat cells and a

high retention of water molecules, as a consequence of a higher cell density that restricts the water

molecules’ diffusion.

2.2 Pre-processing

There was no need for pre-processing. On both ADC maps or In and Opposed phase gradient echo

images, five and two 3D volumes, respectively, are being transformed through the relationship between

voxels at the same position.

Noise removal is a common procedure. Nevertheless, this could lead to the erase of information

contained in the images for higher b-values, for instance, where the signal intensity is lower. MR signal is

usually relative, with large differences between scanners and vendors. By normalizing the image before

feature calculation, this confounding effect may be reduced. However, if only one specific scanner

is used or if the images reflect an absolute value (e.g. ADC maps, T2maps (not T2 weighted)), the

normalization is facultative.

2.3 Segmentation

ITK-SNAP was the tool used to perform the identification of the ROIs on the patients’ spines during

this work. This software allows the consecutive analysis of a set of images collected according with the

9

spine’s sagittal plane. It provides a segmentation tool based on the user input which comprehends a

presegmentation stage which roughly delimits regions based on signal intensity, followed by the iden-

tification of the regions of interest within the delimited areas and finally there is contour evolution that

attempts to fill the areas previously selected. All these steps require user input. To improve the semi-

automatic segmentation achieved, the images can be further worked on with a coloring tool. The extend

of this procedure is highly dependent on image quality.

Segmentation masks were constructed during this work by the author of this thesis for all four types

of MRI images (T1-weighted, STIR, In and Opposed phase gradient echo and DWI) comprehending all

visible vertebrae. Nevertheless, only two masks will be utilize in the feature extraction step: DWI and In

and Opposed phase gradient echo, as the objective is to develop a classifier based on the ADC and FF

maps.

2.4 Feature Extraction

In order to develop an imaging biomarker from the imaging features, the python package PyRadiomics

was used. Concerning the work developed during this thesis, it was used first-order statistics features

extracted from the ADC and FF parametric maps obtained from the original DWI and In and Opposed

phase dual FFE resonance images, respectively. These features describe the distribution of voxel inten-

sity within the mask region.

As defined by the documentation found on the pyradiomics library ”First-order statistics describe the

distribution of voxel intensities within the image region defined by the mask through commonly used and

basic metrics.” [35]

The several metrics used are enumerated and briefly described underneath. [35] Although 19 fea-

tures are presented, only 18 were evaluated in the posterior work. The Variance and Standard Deviation

are highly correlated, therefore only the Variance will be considered.

In the equations displayed regarding some of the metricsX stands for the set ofNp voxels included in

the VOI; P (i) portrays the first order histogram with Ng discrete intensity levels, where Ng is the number

of non-zero bins with a pre-defined bin width and p(i) represents the normalized first order histogram

[P (i)/Np]. The width is usually set in order to obtain a representative number of bins, in between 30 and

130 bins , without compromising the bin’s width.

The array shift (c) is an optional parameter and it ensures that the voxels with the lowest gray values

contribute the least for the metric in question, instead of the voxels with a value closer to zero. This is

commonly used in normalized data.

1. ”Energy is a measure of the magnitude of voxel values in an image.” It is the sum of each squared

voxel intensity, as shown in Equation 2.4, where c stands for an array shift.

energy =

Np∑i=1

[X(i) + c]2 (2.4)

2. ”Total Energy is the value of Energy feature scaled by the volume of the voxel in cubic mm.” It

10

reflects the energy affected by the voxel volume, as shown in Equation 2.5

total energy = Vvoxel

Np∑i=1

[X(i) + c]2 (2.5)

3. ”Entropy specifies the uncertainty/randomness in the image values. It measures the average

amount of information required to encode the image values.” This metric is translated in Equation

2.6, where ε is an arbitrarily small positive number (≈ 2.2 × 10−16).

entropy = Vvoxel

Ng∑i=1

p(i) log2 [p(i) + ε] (2.6)

4. Minimum is the lowest gray level intensity value found within the VOI (Equation 2.7).

minimum = min(X) (2.7)

5. 10th Percentile represents the value below which 10% of the observations fall.

6. 90th Percentile represents the value below which 90% of the observations fall.

7. Maximum is the highest gray level intensity value found within the VOI (Equation 2.8).

maximum = max(X) (2.8)

8. Mean expresses the average gray level intensity within the VOI (Equation 2.9).

mean =1

Np

Np∑i=1

X(i) (2.9)

The mean is commonly represented by µ in statistics.

9. Median is the median gray level intensity within the VOI.

10. Interquartile Range (IQR) expresses the gray intensity values comprehended between the 25th

(P75) and 75th (P75) percentiles, as represented in Equation 2.10. It consists in the 50% of the

gray intensity level values found in the middle of the distribution.

interquartilerange = P75 − P75 (2.10)

11. Range represents the distance between the minimum and maximum gray intensity values, as

shown in Equation 2.11. It is the range of gray values found within the VOI.

range = max(X) −min(X) (2.11)

11

12. ”Mean Absolute Deviation (MAD) is the mean distance of all intensity values from the Mean value

of the image array.” This metric is reflected in Equation 2.12.

MAD =1

Np

Np∑i=1

[X(i) − X] (2.12)

13. ”Robust Mean Absolute Deviation (rMAD) is the mean distance of all intensity values from the

Mean value calculated on the subset of image array with gray levels in between, or equal to the

10th and 90th percentile.” This metric is reflected in Equation 2.13, where N10−90, X10−90(i) and

X10−90 are the number of voxels, the gray intensity level for each voxel and the mean intensity

value, respectively, between the 10th and 90th percentile.

rMAD =1

N10−90

N10−90∑i=1

[X10−90(i) − X10−90] (2.13)

14. ”Root Mean Squared (RMS) is the square-root of the mean of all the squared intensity values.” As

well as the metric Energy, it is a measure of the magnitude of the image values and it is reflected

in Equation 2.14.

RMS =

√√√√ 1

Np

Np∑i=1

[X(i) + c]2 (2.14)

15. ”Standard Deviation (SD) measures the amount of variation or dispersion from the Mean value.”

It is by definition the square root of the variance and it is given by Equation 2.15.

SD =

√√√√ 1

Np

Np∑i=1

[X(i) + X]2 (2.15)

The standard deviation is commonly represented by δ in statistics.

16. ”Skewness measures the asymmetry of the distribution of values about the Mean value.” A neg-

ative skew is related with a longer left tail, while a positive skew is related with a longer right tail.

In either case, the asymmetry is related with a uneven distribution of the values about the mean,

more concentrated to the right or to the left, respectively. Equation 2.16 translates this measure,

where δ represents the standard deviation and µ3 stands for third central moment.

skewness =µ3

δ3=

1Np

∑Np

i=1 [X(i) − X]3(√1Np

∑Np

i=1 [X(i) + X]2)3 (2.16)

If this variable takes values ranging from -0.5 and 0.5, the distribution of the data is considered

fairly symmetrical. If the skewness is in between -1 and -0.5 or between 0.5 and 1, the distribution

is considered moderately skewed. Finally, is the skewness is inferior to -1 or greater than 1, the

distribution is considered highly skewed. [36]

17. Kurtosis is a measure of the curvature of the probability distribution of values in the image VOI.

12

”A higher kurtosis implies that the mass of the distribution is concentrated towards the tail(s) rather

than towards the mean. A lower kurtosis implies the reverse: that the mass of the distribution is

concentrated towards a spike near the Mean value.” Equation 2.17 translates this measure, where

δ represents the standard deviation and µ4 stands for forth central moment.

kurtosis =µ4

δ4=

1Np

∑Np

i=1 [X(i) − X]4(1Np

∑Np

i=1 [X(i) + X]2)2 (2.17)

Peter Westfall stands that the kurtosis reflects the tailedness of the distribution, in other words, how

the values are stretched. The formula used above is referred to as kurtosis (Equation 2.17). Some

authors may use the ”excess kurtosis”, which corresponds to kurtosis − 3. A kurtosis equal to 3

or a ”excess kurtosis” equal to zero corresponds to a population normally distributed. Compared

to the normal distribution, if the kurtosis is lower than 3, then its tails are shorter and thinner and if

the value is bigger than 3, then the tails are longer and broader. [37]

18. ”Variance is the mean of the squared distances of each intensity value from the Mean value. This

is a measure of the spread of the distribution about the mean.” It is by definition the standard

deviation to the 2nd power and it is given by Equation 2.18.

variance =1

Np

Np∑i=1

[X(i) + X]2 (2.18)

19. ”Uniformity is a measure of the sum of the squares of each intensity value. This is a measure of

the homogeneity of the image array, where a greater uniformity implies a greater homogeneity or

a smaller range of discrete intensity values.” This metric is reflected in Equation 2.19.

uniformity =

Ng∑i=1

p(i)2 (2.19)

2.5 Clinical Variables

Although the main goal of this study is the discovery of accurate imaging biomarkers, it was explored the

available clinical data from the patients participating in this study to understand if these variables could

also aid in the prediction of treatment’s response in multiple myeloma.

The 23 variables explored are briefly described below.

1. Age at Rx. Age of the patients the time of the diagnosis.

2. Weight. Weight of the patients in the studies.

3. Height. Height of the patients in the studies.

4. Gender. Gender (male or female) of the patients in the studies.

13

5. Eastern Cooperative Oncology Group Performance Status (ECOGPS). It is a scale that has

the underlined objective of describing the patient’s level of functioning. The scale aids to define the

population of patients in the trial to study new treatment methods and it is a way to track alterations

in a patient’s functioning evolution during treatment. [38]

6. Hemoglubin (Hb). It is present in red blood cells as the oxygen-carrying protein. [39]

7. White Blood Cell (WBC). Blood cell responsible for immune response. [39] Some white cells are

responsible for the production of antibodies.

8. Polymorphonuclear leukophils (PMNs). These are white blood cells characterized by the pres-

ence of granules in their cytoplasm. These granules contain enzymes with broad-based activity

that digest microorganisms and are released during innate immune response. [39]

9. Platelet. Irregular disc-shaped blood component that assists in blood clotting. [39]

10. Albumin. The main protein in human blood, which sustains a key role in regulating the blood

osmotic pressure. [39]

11. Creatinine. It is a chemical waste product resulting from the normal muscle metabolism. This

substance is usually excreted through the urine, therefore it can be an indicator of kidney function.

[40]

12. Urea. This substance is normally removed from the blood stream by the kidneys and excreted in

the urine. Excessive urea in the blood may indicate kidney damage. [39]

13. Calcium. Mineral mainly found in the bones, where it is stored. It is essential for healthy bones.

[39] The excess of calcium in the blood stream is called hypercalcemia and it is a common side

effect of multiple myeloma.

14. Alkaline Phosphatase. Enzyme that liberates phosphate under alkaline conditions. High levels

of this enzyme may be an indication of bone disease. [39]

15. Beta-2 Microglobulin (β2M ). This protein is a component of the major histocompatibility complex

class I molecules. It is used as a tumor marker. [41]

16. Lactate Dehydrogenase (LDH). It is an enzyme found in nearly all living cells (animals, plants,

and prokaryotes), which is responsible for the interconversion between the substrate lactate and

NAD+ to the substrate pyruvate and NADH. [42] It is used as a tumor marker. [39] An increased

amount of LDH indicates possible tissue damage. [43]

17. Actual Percentage. Percentage of marrow plasma cells in the bone marrow. Plasma cells are

white blood cells generated in the bone marrow, which secrete antibodies. The population of

plasma cells present in the marrow may be an indicator of disease progression and tumor load.

[44]

14

18. Serum Peak. Serum is the clear liquid that composes the blood together with the plasma. The

plasma contains red cells, white cells and platelets. [39] The serum peak can measure abnormal

amount of proteins present in the blood, such as the M-protein in MM.

19. Serum Immunoglobulin A (Serum IgA). Immunoglobulin A is an antibody. Abnormal levels of

Immunoglobulin A may aid the diagnosis of MM. [45]

20. Serum Kappa Free. Light chains are one of the two components of the antibodies. One of the two

types of light chains present in humans is denominated kappa (κ). The detection of monoclonal

free light chains is an indicator of monoclonal gammopathies. [46]

21. Serum Lambda Free. In addition to the kappa light chain, it exists in humans the lambda (λ) light

chain. As mentioned previously, the detection of monoclonal free light chains is an indicator of

monoclonal gammopathies. [46]

22. Ratio κλ. Ratio between the Kappa and Lambda free light chains.

23. International Staging System (ISS). The ISS predicts the severity of multiple myeloma based

on easily obtained protein concentrations, such as β2M and albumin concentration. The patient

is classified as having stage I, II or III MM. The increasing stage number is an indicator of the

disease’s progression. [47] [48]

2.6 Statistical Analysis Concepts

Data mining is a growing field and it makes use of several machine learning algorithms. The objective

of this subject is to discover novel and useful patterns that might otherwise remain unknown in big data

sets, as well as, to predict the outcome of a future observation. The data set used in this thesis is rather

small; nevertheless, the principles used are the same. [49]

The imaging features to be extracted from the MR images will be evaluated for their predictive value

in the classification of a patient as responsive or non-responsive to chemotherapy. Although only two

population will be considered, responders and non-responders, the response to induction therapy is

classified in a scale ranging from 1 to 6. Based on IMWG response criteria [50]:

1. Stringent Complete Response

2. Complete Response (CR)

3. Very Good Partial Response (VGPR)

4. Partial Response (PR)

5. Stable Disease (SD)

6. Progressive Disease (PD)

15

There are two more categories that are acknowledge by the IMWG but they were not considered in

the response classification: immunophenotypic complete response and molecular complete response.

The IMWG criteria is on appendix A.1 with a merely informative purpose.

The patients are distributed among two classes, responders and non-responders, considering their

final response to treatment. Three possible approaches are examined regarding the patients’ dis-

tribution among the two classes. Two scenarios (scenarios 1 and 2) were created due to the lack of

consensus in whether a partial response to treatment should be considered an effective response to

treatment. A third situation (scenario 3) is considered, where there is the intention of separating patients

that have a complete or very good response from the ones that only display partial response to the

induction therapy. In this case, the patients that do not respond to treatment are excluded. This scenario

surges from the elevated number of patients that respond to treatment, but that may need treatment

adjustments according with the kind of response they present. Beneath all scenarios are described.

• Scenario 1: the partial response (4) is treated as an effective response to treatment. Responders:

patients showing a minimum of partial response, classified as 1, 2, 3 and 4. Non-responders:

patients that display stable or progressive disease, classified as 5 and 6.

• Scenario 2: the partial response (4) is not treated as an effective response to treatment. Re-

sponders: patients showing complete or very good partial response, classified as 1, 2 and 3.

Non-responders: patients that show partial response or display stable or progressive disease,

classified as 4, 5 and 6.

• Scenario 3: only patients with a minimum of partial response are evaluated and the patients with

partial response are treated as a separate group. Responders: patients showing complete or

very good partial response, classified as 1, 2 and 3. Non-responders: patients that show partial

response to treatment, classified as 4.

These designations will be used throughout this work (scenario 1, scenario 2 and scenario 3.

In the appendix B.1 is depicted the response after the induction therapy in the scale previously

presented (1 to 6) for all the three scenarios.

2.6.1 Univariate Analysis

A univariate analysis consists in the review of the effect of one independent variable in a single depen-

dent variable (response or outcome variables). [51] [52]

Statistical Hypothesis Testing

In statistical hypothesis testing are confronted two hypothesis regarding an unknown parameter from a

known distribution of a random variable of interest. The outcome of the test will dictated if a defined null

hypothesis is rejected when confronted with an alternative hypothesis. [53]

The decision made regarding the rejection or non-rejection of the null hypothesis is derived from a

statistical test based on the information collected from a sample. Nevertheless, there are always two

16

types of errors that may happen: type I error - the null hypothesis is rejected although it is true and type

II error - the null hypothesis is not rejected although it is false. The highest significance level (α) that

does not lead to the rejection of the null hypothesis is denominated p-value. [53]

One can say that a test has statistical significance if the p-value is lower than the level of significance

defined in the study.

A widely used statistical test that may be applied to assess the separation of two populations is the

t-Test which compares the means of two independent groups. [54] This type of parametric test assumes

that the two populations have normal distributions, which is not guaranteed due to the small sample size.

Parametric tests have a greater statistical power, in other words, there is a higher probability that the

test correctly rejects the null hypothesis, avoiding type II errors. [55] Therefore, it is of general agreement

that the adequate parametric test should be used to evaluate data if there is no reason to believe that

its assumptions are being violated. [56] Nevertheless, nonparametric tests have been reported as a

satisfactory alternative in biomedical sciences, especially in small samples. [54] Nonparametric tests do

not make any specific assumptions regarding the population parameters that characterize the underlying

distribution, unlike the parametric tests, therefore they are the best suited for this study. [56] Furthermore,

these tests use the median as the location measure, instead of the mean, as it presents a lower variation

in skewed data and in the presence of outliers. [54]

The Mann-Whitney U-Test evaluates if two independent samples represent two populations with

different median values. [56] The null hypothesis states that both samples come from the same popula-

tion. If the Mann-Whitney U-Test is found statistically significant, one can conclude that there is a high

likelihood that the samples represent populations with different median values.

Multiple Comparison Correction

When a large number of statistical tests is performed, there is a chance that in some of them, the p-value

follows under a defined critical value by chance, leading to the wrong rejection of the null hypothesis

(type I error). In order to minimize the false positive rate, corrections to the p-value may be applied. [57]

The Bonferroni correction, Holm method and Benjamini-Hochberg procedure are three possible multiple

comparison methods and they are further explored in this thesis.

In the Bonferroni correction instead of using an usual critical significance level (commonly 0.05), it

is used a lower critical value. One would estimate it by dividing the usual critical significance level (α) by

the number of tests (m). A test is considered statistically relevant if the p-value associated (Pi) is lower

than the new critical significance level found, as shown in Equation 2.20.

Pi <α

m(2.20)

This correction minimizes the family-wise error rate (FWER), which is the probability of making at

least one false conclusion, in other words, it is the probability of making a type I error at all.

This correction is mainly useful for a small number of multiple comparisons and when it is expected

that just a couple of attributes are meaningful. In big data sets there is the risk of estimating an unrea-

17

sonably small critical value, leading to a high rate of false negatives. [57] Nevertheless, this approach is

often considered very conservative.

One can also say that the p-values obtained for each attribute may be adjusted and compared to the

usual critical significance level, instead of lowering the latter. That would be obtained adapting Equation

2.20, where the p-value of each attribute is multiplied by the number of tests in order to obtain a adjusted

p-value .

Each test is considered statistically relevant if the adjusted p-value associated is lower than the

standard critical significance level .

The Holm method is considered to be less conservative than the Bonferroni correction, while still

trying to minimize the FWER. In this approach, the hypothesis are ordered from the smallest p-value

to the greatest and ranked, where the hypothesis with the smallest p-value has a rank of k = 1. The

p-value for each hypothesis is obtained by dividing the significance level pretended (α) by the possible

true hypothesis, which corresponds to the total number of hypothesis (m) minus the hypothesis already

sequentially rejected (m + 1 − k). A hypothesis is rejected if the unadjusted p-value (Pk) is lower than

the adjusted significance level, as shown in Equation 2.21. [58]

Pk <α

m+ 1 − k(2.21)

The variable k corresponds to the rank position of the hypothesis being tested. Regarding the first

ranked hypothesis, the adjusted significance level is given similarly to the Bonferroni correction and is

equal to α/m.

In the same fashion as the previous method (Bonferroni correction), instead of an altered significance

level, it can be determined an adjusted p-value for each test by multiplying the original p-value by the

denominator in Equation 2.21 (m + 1 − k). and the adjusted p-value is then compared to the usual

significance level. The test is considered statistically significant if the adjusted p-value is lower than the

pretended significance level.

The Benjamini-Hochberg (BH) procedure decreases the false discovery rate (FDR) by altering the

p-value determined by the test for each attribute. This false discovery rate reflects the expected propor-

tion of type I errors. [59]

Using this approach, the attributes are initially arranged in increasing order and ranked, where the

smallest p-value has a rank of k = 1. Then, each individual p-value is compared to the Benjamini-

Hochberg correspondent critical value. These values are determined as shown in Equation 2.22.

BH critical value =k

m·Q (2.22)

where k is the rank of the variable, m is the number of tests and Q is the defined false discovery rate.

Each attribute will have a corresponding BH critical value calculated. The highest ranked attribute

(attribute with the higher value of k) that displays a p-value under its respective BH critical value is

considered the threshold at which the attributes should stop being considered statistically relevant, in

other words, all the attributes ranked with a lower value for k should be considered statistically relevant

18

even if their individual p-value is superior to their calculated BH critical value.

It can also be calculated an adjusted BH p-value. This can be either the raw p-value multiplied bym/k

or the adjusted p-value for the next higher raw p-value, whichever is the smallest. When the adjusted

p-value is smaller than the false discovery rate established the test is considered significant. [57]

In Bonferroni, Holm and BH corrections, all the individual tests are considered independent from

each other.

Unlike the univariate analysis, multivariate statistics allows the analyses of several independent

and/or dependent variables. This type of analysis takes into consideration the correlation among de-

pendent variables. [51][52] This type of analysis will not be performed given the small sample size, as

mentioned previously.

2.6.2 Performance Metrics

The performance metrics reflect how good a classifier is and allow the comparison among classifiers.

One cannot say that a performance metric is better to evaluate a classifier when compared to all the

others. A combination of several measures may be used to investigate a classifier performance. [60]

Confusion Matrix

A confusion matrix enables the performance evaluation of a classifier on a given labeled data set. This

performance measure relies on the combination between predicted and actual values.

In a binary classifier (where the outcome is considered either positive or negative), a confusion matrix

can be generally represented as demonstrated by Table 2.1, where the rows correspond to the predicted

values, while the columns represent the actual values.

Table 2.1: General form of a confusion matrix.

Actual Positive Actual Negative

Predicted Positive True Positive False Negative

Predicted Negative False Positive True Negative

The combination between the predicted and actual class value receives different designations. A true

positive (TP) occurs when a instance is correctly predicted as positive. A false positive (FP) occurs

when a instance is mistakenly predicted as positive. A false negative (FN) occurs when a instance is

mistakenly predicted as negative. A true negative (TN) occurs when a instance is correctly predicted

as negative.

From a confusion matrix several performance metrics can be calculated.

• Accuracy translates the frequency of correct classifications (Equation 2.23).

accuracy =TP + TN

TP + FP + FN + TN(2.23)

19

• Precision, also known as positive predictive rate, represents the fraction of positives correctly

predicted over the total of instances predicted as positive (Equation 2.24).

precision =TP

TP + FP(2.24)

• Sensitivity, also known as recall or true positive rate (TPR), translates the fraction of positives

correctly predicted over the total of instances with real positive value (Equation 2.25).

sensitivity =TP

TP + FN(2.25)

• Specificity represents the fraction of negatives correctly predicted over the total of instances with

real negative value (Equation 2.26).

specificity =TN

TN + FP(2.26)

Complementary to the specificity performance metric there is the false positive rate (FPR), which

translates the frequency which an observation is wrongly predicted as positive (Equation 2.27).

FPR = 1 − specificity =FP

TN + FP(2.27)

• F1-measure summarizes the precision (p) and recall (r) performance measures, since it represents

a harmonic mean between these two (Equation 2.28). [49]

F =2 · r · pr + p

(2.28)

The metrics described above will be later used during the univariate analysis.

ROC Curve

A ROC curve is a performance measurement for classification problems at several thresholds settings.

The thresholds are defined according with each variable value.

The ROC curve is traced as TPR vs FPR. For a certain attribute, at each defined threshold, all the

observations above that threshold will be predicted as belonging the positive class, while the remaining

observations will be considered as belonging to the negative class. Then the predicted class values

will be compared to its actual class values, leading to the calculation of the true positive rate and of the

false positive rate. To each threshold will correspond a point in the graphic (TPR, FPR). When all the

thresholds for a certain attribute are evaluated, its ROC curve is traced. A good classifier should be

located as close as possible to the upper left corner of the diagram, while a random classifier will lie

along the diagonal. [49] [61]

The area under the curve (AUC) translates the separability of the two classes. The AUC presents

values ranging between 0 and 1. The higher the AUC the better the classifier capacity of distinguishing

20

the two classes. When the AUC is equal to 0.5 it means that the classifier as no discriminating power,

the decisions are made with the same certainty as a coin toss. [49] [61] An AUC between 0.5 and 0.7 is

considered poor, in between 0.7 and 0.8 it is considered good, in between 0.8 and 0.9 it is considered

very good and if it is over 0.9 it is considered excellent.

The threshold chosen to be applied by the classifier should have the best possible combination be-

tween sensitivity and specificity. The compromise achieved between both is always problem dependent.

Usually a test with high sensitivity has low specificity. Sensitivity is chosen in detriment of specificity if it

is more important to correctly identify a positive outcome than a negative one, the test is subjected to a

higher number of false positives. The opposite is also true, specificity is chosen in detriment of sensitivity

if it is more important to correctly identify a negative outcome than a positive one, the test is subjected

to a higher misidentification of positives cases.

The Problem of Class Imbalance

Accuracy is a performance metric commonly used to analyze the performance of a classifier. Never-

theless, it may not be suited when the data set has imbalanced class distributions, which is a regular

occurrence in real applications. The accuracy of a classifier can be extremely high in the presence of

rare events. For instance, lets assume that only 1% on the vaccines produced by a certain company

are defective. If the classifier predicted that all vaccines are good, it would have an accuracy of 99%,

however it is important to detect this rare event. [49]

For binary classification other performance metrics can be used, such as precision, recall and the

ROC curve. Usually, the rare class is denoted as the positive class, while the majority class is denoted

as the negative class. [49]

Both precision and recall focus on the positive class. The first declares the true positives among

the instances predicted as positive, while the later evaluates the predicted true positives among all the

actual positive instances. The F1-measure is often optimized when the positive class is considered more

interesting than the negative class, since it will conduct to an optimal comprise between both precision

and recall. [49]

The ROC curve is a graphical approach that depicts the trade off between the TPR and the FPR, as

previously explained. This metric is appropriate to compare the relative performance between different

classifiers. In particular when combined with the AUC, which allows to infer which classifier is better on

average. [49]

21

3 Materials and Methods

3.1 Patient Population

The group of patients recruited for this study is composed by 30 people, 15 women and 15 men, with

an age range from 37 to 79 years, a mean age of 63 and a median age of 68 years. The classification

of patient’s final response to treatment was made according with the IWMG response criteria: 1 patient

showed stringent complete response, 0 patients with complete response, 11 displayed very good par-

tial response, 12 presented partial response, 4 exhibit stable disease and 2 were classified as having

progressive disease. [50]

3.2 MRI Techniques

Patients underwent MRI before chemotherapy and after they had completed one cycle of chemotherapy,

with the exception of 4 patients (ID: 27 to 30), to whom the magnetic resonance images were acquired

before and after all the chemotherapy cycles. All the MRI examinations were conducted in the University

Hospital of Athens, Greece. In order to obtain the magnetic resonance images, the following pulse

sequences were performed: T1-weighted sagittal lumbar spine (repetition time: 400 msec, echo time:

7.4 msec, section thickness: 4.0 mm, section gap: 0.8 mm, image matrix: 246 x 512, number of signals

acquired: 4, field of view: 300 x 300 mm, acquisition time: 138 sec), short TI inversion recovery sagittal

lumbar spine (repetition time: 2500 msec, echo time: 60 msec, section thickness: 4.0 mm, section gap:

0.8 mm, image matrix: 214 x 512, number of signals acquired: 4, field of view: 300 x 300 mm, acquisition

time for each sequence: 180 sec), dual gradient-echo in- and opposed-phase lumbar spine (repetition

time: 300 msec, echo time for the opposed and in phase, respectively: 2.3/4.6 msec, section thickness:

4.0 mm, section gap: 0.8 mm, flip angle: 25°, image matrix: 122 x 512, number of signals acquired:

4, field of view: 300 x 300 mm, acquisition time for each sequence: 116 sec) and DWI (steady-state

echo-planar lumbar spine (repetition time: 2000 msec, echo time: 75 msec, section thickness: 5.0 mm,

section gap: 1.0 mm, image matrix: 152 x 256, number of signals acquired: 8, field of view: 300 x 300

mm, acquisition time: 308 sec). The b-values used for the DWI pulse sequence were 0, 150, 25, 500

and 750 sec/mm2. [62]

Machine: 1.5-T unit (Philips Healthcare, Best, the Netherlands) with a surface phased-array coil.

23

3.3 Statistical Analysis Implementation

3.3.1 Data sets

The MRI images were acquire before and after the first round of chemotherapy. For this reason, one

may consider three distinct situations for data analysis: pre-treatment (it is composed by the imaging

features related with the MRI images acquired before treatment), post-1stcycle (it is composed by

the imaging features related with the MRI images acquired during treatment after the first round of

chemotherapy) and delta (it consists on the difference between the post-1stcycle and pre-treatment

moments, it represents the evolution of the signal’s intensity over the first round of chemotherapy). For

each parametric map (ADC and FF), three data sets were organized, each one considering a different

situation (pre-treatment, post-1stcycle or delta).

There is a seventh data set considered, this accounts for an additional exploitation of the data avail-

able. It comprehends the clinical data from all the patients evolved in this study. Same of the feature

evaluated were: age, height, gender, weight and protein levels, all of which are explained in the section

2.5.

As mentioned previously in the section 2.6, three situations were considered in terms of response

classification (scenario 1, 2 and 3). The labelled data sets will have different class proportions.

When the partial response to the treatment is considered an effective response to treatment (scenario

1), the resulting data set recognizes approximately 20% of the patients as non-responsive and 80% as

responsive to treatment. On the other hand, if the partial response to treatment is not considered an

effective response to treatment (scenario 2), the resulting data set recognizes approximately 60% of the

patients as non-responsive and 40% as responsive. In scenario 1, one obtains a more unbalanced data

set when compared with scenario 2. When the patients with stable or progressive disease are excluded

from the study (scenario 3), approximately half of the patients are considered as responsive, while the

other half is considered non-responsive to treatment.

3.3.2 Data Preparation

Each data set was individually inspected and the patients with missing imaging features values were

removed from the data set under examination. Furthermore, patients 27 to 30 underwent the second MRI

scan not after the first round of chemotherapy but in the end of the complete chemotherapy treatment,

for this reason, in the data sets with respect to the post-1stcycle and delta situations these patients are

not present. Taking into consideration the above stated, the data sets do not present necessarily an

equal number of patients, ranging from the original 30 patients down to 23. In the particular case of

scenario 3, some data sets only consider 18 patients, once the ones with stable or progressive disease

are removed.

24

3.3.3 Univariate Analysis

p-value Evaluation

The Mann-Whitney U-Test is applied using the operator Mann-Whitney U-Test, available on Rapid-

Miner, that computes the p-value associated with the null hypothesis - whether the two samples (re-

sponders and non-responders) come from the same population by comparing their medians.

Different multiple comparison corrections were applied: Bonferroni, Holm and Benjamini-Hochberg.

These methods are successively less conservative and it is expected that in the less conservative cor-

rections more attributes are indicated as statistically significant.

It was defined that if the adjusted p-value correspondent to a certain attribute is lower than the

standard significance level of 0.05, one can be confident in the rejection of the null hypothesis for the

Mann-Whitney U-Test, which states that the two population have the same median, in order words, by

rejecting the null hypothesis, the two populations are considered well separated by the attribute tested.

In order to obtain the adjusted p-values for each of the multiple comparison corrections considered,

a small script was compiled on R and further applied in the process developed with the software Rapid-

Miner. The script created will take as input the data generated from the Mann-Whitney U-Test and

apply the R function p.adjust, which will generate the new adjusted p-value for each of the attributes

considered by the multiple comparison corrections previously described. The script compiled on R for

the calculation of the adjusted p-values in available in the Appendix Subsection C.1.1. The number of

comparisons is automatically set for the number of attributes present in each data set. Since the imaging

features p-values from the ADC and FF maps are corrected for multiple comparisons together in each

condition (pre-treatment, post-1stcycle and delta) and in each possible scenario (1, 2 and 3), the number

of comparisons is equal to 36.

The general process developed for p-value collection in the univariate analysis is depicted in Figure

3.1.

ROC Curve Evaluation

The AUC performance measure was also used to evaluate how well each attribute performs on the

binary classification.

The code for the AUC attained is displayed in the Appendix Subsection C.1.2. The code was written

in R to be implemented in a recursive fashion using the package pROC.

When tracing the ROC curve, the function roc available in the package pROC allows the definition

of the direction in which the thresholds are evaluated: in ascending or descending order of value. The

function automatically compares the medians from both groups. If the median of the positive class is

higher than the median from the negative class, the thresholds are evaluated in descending order (from

the higher to the lower threshold value). On the other hand, if the median of the negative class is higher

than the median from the positive class, the thresholds are evaluated in ascending order (from the lower

to the higher threshold value).

In the ROC curves construction, the positive class is the minority class. In scenario 1 and in the

25

Figure 3.1: General process used for the collection of the p-values in the univariate analysis.

FF map data sets from scenario 3, the positive class will be the non-responders. In this case, and

considering the problem at hands, one wants to minimize the FPR, even if it means to have a worst TPR.

The motive that underlines this decision is the priority given on continuing the treatment with people that

are responding to it. By minimizing the FPR, the specificity is being maximized, which directly translates

into the maximization of the number of patients correctly predicted as true negatives, in other words,

true responders to treatment. If the responders are the positive class (scenario 2 and ADC map data

sets from scenario 3), then one would want the opposite situation, which means maximizing the TPR,

even considering a worst FPR.

The ROC curve of each attribute with an AUC over 0.70 was further analyzed. For each data set

were selected the attributes with the considered best sensitivity and specificity. In scenario 1 and in the

FF map data sets from scenario 3, for the TPR and the FPR were only considered values over 0.6 and

under 0.2, respectively. While in scenario 2 and ADC map data sets from scenario 3, for the TPR and

the FPR were only considered values over 0.8 and under 0.4, respectively. Afterwards, the attributes

from this filtered subset were compared among themselves with the intend of finding which attributes

would be the best possible biomarkers.

All attributes are considered to be independent in a univariate analysis and they are analyzed indi-

vidually when evaluating the ROC curve.

26

4 Results

4.1 Univariate Analysis

Using univariate analysis it is possible to develop an idea of how the attributes would behave in an

independent manner, revealing which ones would be more interesting to further explore. These consid-

erations could be then transported to the multivariate analysis.

4.1.1 p-value and ROC Curve Evaluation for the First Order Imaging Features

A written analysis was conducted in order to analyze the results depicted in Tables 4.1 to 4.9 that display

all the original and adjusted p-values and the AUC (retrieved from the ROC curve) for each test of each

study. As mention previously, it was chosen the significance level of 0.05. All the p-values that fall under

this threshold are colored blue. The AUC values are also presented in these tables and the ones equal

or over 0.70 are highlighted in orange. This threshold was chosen to guarantee that the attributes further

evaluated have an AUC value associated with the ROC curve are considered at worst good.

In the pre-treatment data set in scenario 1 there are 4 attributes that show statistical significance

regarding the ADC map (interquartile range, kurtosis, mean absolute deviation and minimum) and there

are 14 attributes that show statistical significance concerning the FF map (90 percentile, energy, entropy,

interquartile range, maximum, mean, mean absolute deviation, median, range, robust mean absolute de-

viation, root mean squared, skewness, total energy and uniformity) after the Mann-Whitney U-Test. Al-

though none of the attributes survive the more conservative multiple comparison corrections (Bonferroni

correction and Holm’s method), there are 10 attributes that survive the Benjamini-Hochberg procedure:

the attribute kurtosis regarding the ADC map and the attributes 90 percentile, energy, entropy, mean,

median, RMS, skewness, total energy and uniformity concerning the FF map. A considerable amount of

attributes displays an AUC over 0.70, 8 attributes regarding the ADC map (entropy, interquartile range,

kurtosis, mean absolute deviation, minimum, robust mean absolute deviation, uniformity and variance)

and 17 out of the 18 attributes concerning the FF map, all with the exception of the minimum. All the

attributes that survive the Benjamini-Hochberg procedure, present an AUC over 0.840, which is already

considered a very good value. These results are presented in Table 4.1.

In the pre-treatment data set in scenario 2 only the attribute kurtosis concerning the FF map

shows statistical significance after the Mann-Whitney U-Test and it does not survive any of the multiples

comparison corrections. Just 3 attributes regarding the FF map (interquartile range, kurtosis and robust

mean absolute deviation) display an AUC value over 0.70. These results are presented in Table 4.2.

In the pre-treatment data set in scenario 3 there are 6 attributes that show statistical significance

regarding the FF map (10 percentile, 90 percentile, kurtosis, median, root mean squared and skewness)

after the Mann-Whitney U-Test. Nevertheless, none of the attributes’ p-value remains under the thresh-

27

old of 0.05 when submitted to the multiple comparison corrections. The attributes that show statistical

significance after the Mann-Whitney U-Test are the only ones to display an AUC value over 0.70. These

results are presented in Table 4.3.

In the post-1stcycle data set in scenario 1 there are 5 attributes that show statistical significance

regarding the ADC map (interquartile range, mean absolute deviation, minimum, robust mean absolute

deviation and variance) and there is 1 attribute that shows statistical significance concerning the FF map

(skewness) after the Mann-Whitney U-Test. Nevertheless, none of the attributes’ p-value remains under

the threshold of 0.05 when submitted to the multiple comparison corrections. A considerable amount

of attributes displays an AUC over 0.70, 10 attributes regarding the ADC map (10 percentile, entropy,

interquartile range, kurtosis, mean absolute deviation, minimum, range, robust mean absolute deviation,

uniformity and variance) and 10 attributes concerning the FF map (10 percentile, 90 percentile, energy,

entropy, kurtosis, mean, median, root mean squared, skewness, total energy and uniformity). These

results are presented in Table 4.4.

In the post-1stcycle data set in scenario 2 only the attribute kurtosis concerning the FF map shows

statistical significance after the Mann-Whitney U-Test and it does not survive any of the multiples com-

parison corrections. This attribute is also the only to display an AUC value over 0.70. These results are

presented in Table 4.5.

In the post-1stcycle data set in scenario 3 there are 8 attributes that show statistical significance

regarding the FF map (10 percentile, 90 percentile, energy, mean absolute deviation, median, root

mean squared, skewness and total energy) after the Mann-Whitney U-Test. Nevertheless, none of

the attributes’ p-value remains under the threshold of 0.05 when submitted to the multiple comparison

corrections. An AUC over 0.70 is displayed by 10 attributes in the FF map (10 percentile, 90 percentile,

energy, kurtosis, maximum, mean absolute deviation, median, root mean squared, skewness and total

energy). These results are presented in Table 4.6.

In the delta data set in scenario 1 there is no attribute that show statistical significance after the

Mann-Whitney U-Test. There are only 4 attributes that display an AUC value over 0.70, 1 regrading

the ADC map (variance) and 3 attributes concerning the FF map (energy, mean absolute deviation and

robust mean absolute deviation). These results are presented in Table 4.7.

In the delta data set in scenario 2 there are 5 attributes that display statistical significance regarding

the ADC map (interquartile range, mean absolute deviation, range, robust mean absolute deviation and

variance) and 5 attributes are considered statistically significant concerning the FF map (10 percentile,

90 percentile, mean, median and root mean squared) after the Mann-Whitney U-Test. Nevertheless,

none of the attributes’ p-value remains under the threshold of 0.05 when submitted to the multiple com-

parison corrections. Some attributes display an AUC over 0.70, 8 attributes regarding the ADC map (90

percentile, entropy, interquartile range, maximum, mean absolute deviation, range, robust mean abso-

lute deviation and variance) and 6 concerning the FF map (10 percentile, 90 percentile, mean, median,

root mean squared and skewness). These results are presented in Table 4.8.

In the delta data set in scenario 3 there are 5 attributes that display statistical significance regarding

the ADC map (90 percentile, interquartile range, mean, robust mean absolute deviation and variance)

28

and 6 attributes are considered statistically significant concerning the FF map (10 percentile, 90 per-

centile, mean absolute deviation, median, root mean squared and skewness) after the Mann-Whitney

U-Test. Nevertheless, none of the attributes’ p-value remains under the threshold of 0.05 when submitted

to the multiple comparison corrections. Some attributes display an AUC over 0.70, 9 attributes regarding

the ADC map (90 percentile, entropy, interquartile range, maximum, mean, range, robust mean absolute

deviation, root mean squared and variance) and 6 concerning the FF map (10 percentile, 90 percentile,

mean absolute deviation, median, root mean squared and skewness). These results are presented in

Table 4.9.

Through the Tables 4.1 to 4.9 it is possible to redraw some of the importance of the multiple compari-

son corrections. There is always a significant drop in the number of attributes with statistical significance

when the multiple comparison corrections are applied, as expected. From the 216 attributes evaluated,

35 display statistical significance after the Mann-Whitney U-Test and only 10 survive the Benjamini-

Hochberg procedure. None of the attributes evaluated survives the more conservative approaches,

Bonferroni and Holm corrections.

It is interesting to observe that all the attributes that survive the Benjamini-Hochberg procedure are

first-order statistics which belong to the pre-treatment data sets in scenario 1, and the vast majority was

extracted from the FF map.

29

Table 4.1: p-values achieved with the Mann-Whitney U-Test (MW U-Test) before and after the imple-mentation of the multiple comparison corrections and the AUC value obtained using the ROC curve,with the respective 95% confidence interval, regarding each attribute of the pre-treatment data set inscenario 1. These results were achieved using RapidMiner® and RStudio®. The p-values under the0.05 significance level are stressed in the color blue. The AUC values over 0.70 are stressed in the colororange.

Map Attributep-values

AUCMW U-Test Bonferroni Holm BH

ADC

10 Percentile 0.216 1.000 1.000 0.277 0.667 (0.449, 0.884)90 Percentile 0.554 1.000 1.000 0.664 0.580 (0.276, 0.883)Energy 0.788 1.000 1.000 0.810 0.464 (0.152, 0.775)Entropy 0.053 1.000 0.947 0.084 0.761 (0.463, 1.000)Interquartile Range 0.046 1.000 0.927 0.084 0.768 (0.495, 1.000)Kurtosis 0.008 0.300 0.283 0.042 0.855 (0.679, 1.000)Maximum 0.146 1.000 1.000 0.195 0.696 (0.407, 0.984)Mean 0.788 1.000 1.000 0.810 0.536 (0.252, 0.820)MAD 0.046 1.000 0.927 0.084 0.768 (0.482, 1.000)Median 0.788 1.000 1.000 0.810 0.536 (0.252, 0.820)Minimum 0.036 1.000 0.858 0.084 0.783 (0.555, 1.000)Range 0.146 1.000 1.000 0.195 0.696 (0.402, 0.989)Robust MAD 0.053 1.000 0.947 0.084 0.761 (0.475, 1.000)RMS 0.914 1.000 1.000 0.914 0.486 (0.192, 0.779)Skewness 0.667 1.000 1.000 0.774 0.558 (0.321, 0.795)Total Energy 0.747 1.000 1.000 0.810 0.457 (0.151, 0.763)Uniformity 0.053 1.000 0.947 0.084 0.761 (0.488, 1.000)Variance 0.053 1.000 0.947 0.084 0.761 (0.463, 1.000)

FF


30




ADC


FF


31


p-valuesMap Attribute

MW U-Test Bonferroni Holm BHAUC

10 Percentile 0.218 1.000 1.000 0.561 0.652 (0.411, 0.892)90 Percentile 0.758 1.000 1.000 0.758 0.538 (0.273, 0.803)Energy 0.498 1.000 1.000 0.750 0.583 (0.323, 0.844)Entropy 0.622 1.000 1.000 0.750 0.561 (0.295, 0.827)Interquartile Range 0.644 1.000 1.000 0.750 0.557 (0.295, 0.819)Kurtosis 0.758 1.000 1.000 0.758 0.538 (0.270, 0.805)Maximum 0.538 1.000 1.000 0.750 0.576 (0.311, 0.841)Mean 0.538 1.000 1.000 0.750 0.576 (0.313, 0.839)MAD 0.389 1.000 1.000 0.750 0.606 (0.353, 0.859)Median 0.424 1.000 1.000 0.750 0.598 (0.346, 0.852)Minimum 0.735 1.000 1.000 0.758 0.542 (0.396, 0.688)Range 0.667 1.000 1.000 0.750 0.553 (0.286, 0.820)Robust MAD 0.667 1.000 1.000 0.750 0.553 (0.290, 0.816)RMS 0.460 1.000 1.000 0.750 0.591 (0.337, 0.845)Skewness 0.218 1.000 1.000 0.561 0.652 (0.410, 0.893)Total Energy 0.667 1.000 1.000 0.750 0.553 (0.296, 0.810)Uniformity 0.622 1.000 1.000 0.750 0.561 (0.289, 0.832)

ADC

Variance 0.622 1.000 1.000 0.750 0.561 (0.298, 0.823)10 Percentile 0.039 1.000 1.000 0.251 0.754 (0.549, 0.959)90 Percentile 0.025 0.888 0.888 0.251 0.777 (0.555, 0.998)Energy 0.124 1.000 1.000 0.496 0.689 (0.449, 0.930)Entropy 0.424 1.000 1.000 0.750 0.598 (0.351, 0.846)Interquartile Range 0.186 1.000 1.000 0.557 0.663 (0.435, 0.891)Kurtosis 0.036 1.000 1.000 0.251 0.758 (0.548, 0.967)Maximum 0.601 1.000 1.000 0.750 0.564 (0.375, 0.754)Mean 0.242 1.000 1.000 0.581 0.644 (0.408, 0.880)MAD 0.049 1.000 1.000 0.251 0.742 (0.510, 0.975)Median 0.034 1.000 1.000 0.251 0.761 (0.523, 0.999)Minimum 0.148 1.000 1.000 0.514 0.678 (0.474, 0.882)Range 0.498 1.000 1.000 0.750 0.583 (0.343, 0.823)Robust MAD 0.157 1.000 1.000 0.514 0.674 (0.447, 0.901)RMS 0.031 1.000 1.000 0.251 0.765 (0.541, 0.989)Skewness 0.042 1.000 1.000 0.251 0.750 (0.511, 0.989)Total Energy 0.085 1.000 1.000 0.382 0.712 (0.478, 0.946)Uniformity 0.758 1.000 1.000 0.758 0.538 (0.283, 0.793)

FF

Variance 0.295 1.000 1.000 0.665 0.629 (0.390, 0.867)

32

Table 4.4: p-values achieved with the Mann-Whitney U-Test (MW U-Test) before and after the imple-mentation of the multiple comparison corrections and the AUC value obtained using the ROC curve,with the respective 95% confidence interval, regarding each attribute of the post-1stcycle data set inscenario 1. These results were achieved using RapidMiner® and RStudio®. The p-values under the0.05 significance level are stressed in the color blue. The AUC values over 0.70 are stressed in the colororange.



ADC


FF

10 Percentile 0.110 1.000 1.000 0.222 0.737 (0.340, 1.000)90 Percentile 0.095 1.000 1.000 0.222 0.747 (0.367, 1.000)Energy 0.110 1.000 1.000 0.222 0.737 (0.354, 1.000)Entropy 0.126 1.000 1.000 0.240 0.726 (0.349, 1.000)Interquartile Range 0.546 1.000 1.000 0.702 0.589 (0.179, 1.000)Kurtosis 0.145 1.000 1.000 0.249 0.716 (0.373, 1.000)Maximum 0.227 1.000 1.000 0.355 0.679 (0.274, 1.000)Mean 0.110 1.000 1.000 0.222 0.737 (0.354, 1.000)MAD 0.455 1.000 1.000 0.631 0.611 (0.195, 1.000)Median 0.110 1.000 1.000 0.222 0.737 (0.354, 1.000)Minimum 0.972 1.000 1.000 0.972 0.505 ( 0.234, 0.755)Range 0.374 1.000 1.000 0.561 0.632 (0.219, 1.000)Robust MAD 0.696 1.000 1.000 0.835 0.558 (0.123, 0.993)RMS 0.110 1.000 1.000 0.222 0.737 (0.354, 1.000)Skewness 0.017 0.621 0.621 0.222 0.853 (0.654, 1.000)Total Energy 0.110 1.000 1.000 0.222 0.737 (0.354, 1.000)Uniformity 0.110 1.000 1.000 0.222 0.737 (0.363, 1.000)Variance 0.455 1.000 1.000 0.631 0.611 (0.178, 1.000)

33




ADC


FF


34




10 Percentile 0.573 1.000 1.000 0.688 0.573 (0.301, 0.844)90 Percentile 0.149 1.000 1.000 0.440 0.686 (0.437, 0.935)Energy 0.573 1.000 1.000 0.688 0.573 (0.306, 0.840)Entropy 0.439 1.000 1.000 0.688 0.600 (0.325, 0.875)Interquartile Range 0.460 1.000 1.000 0.688 0.595 (0.311, 0.880)Kurtosis 0.673 1.000 1.000 0.734 0.445 (0.186, 0.705)Maximum 0.291 1.000 1.000 0.551 0.636 (0.365, 0.907)Mean 0.526 1.000 1.000 0.688 0.582 (0.297, 0.867)MAD 0.205 1.000 1.000 0.492 0.664 (0.406, 0.921)Median 0.260 1.000 1.000 0.550 0.645 (0.37910.912)Minimum 1.000 1.000 1.000 1.000 0.500 (0.500, 0.500)Range 0.291 1.000 1.000 0.551 0.636 (0.365, 0.907)Robust MAD 0.481 1.000 1.000 0.688 0.591 (0.304, 0.878)RMS 0.159 1.000 1.000 0.440 0.682 (0.423, 0.941)Skewness 0.439 1.000 1.000 0.688 0.600 (0.334, 0.867)Total Energy 0.573 1.000 1.000 0.688 0.573 (0.298, 0.847)Uniformity 0.481 1.000 1.000 0.688 0.591 (0.309, 0.873)

ADC


FF

Variance 0.624 1.000 1.000 0.702 0.433 (0.143, 0.724)

35

Table 4.7: p-values achieved from the Mann-Whitney U-Test for each attribute of interest before andafter the implementation of the multiple comparison corrections and the AUC value obtained using theROC curve, with the respective 95% confidence interval, regarding each attribute of the delta data setin scenario 1. These results were achieved using RapidMiner® and RStudio®. The p-values under the0.05 significance level are stressed in the color blue. The AUC values over 0.70 are stressed in the colororange.



ADC


FF


36




ADC


FF


37




10 Percentile 0.324 1.000 1.000 0.556 0.627 (0.364, 0.891)90 Percentile 0.035 1.000 1.000 0.139 0.773 (0.549, 0.997)Energy 0.888 1.000 1.000 0.913 0.518 (0.252, 0.784)Entropy 0.067 1.000 1.000 0.173 0.736 (0.501, 0.972)Interquartile Range 0.049 1.000 1.000 0.159 0.755 (0.534, 0.976)Kurtosis 0.481 1.000 1.000 0.619 0.591 (0.321, 0.861)Maximum 0.067 1.000 1.000 0.173 0.736 (0.507, 0.966)Mean 0.029 1.000 0.871 0.139 0.782 (0.549, 1.000)MAD 0.439 1.000 1.000 0.619 0.600 (0.336, 0.864)Median 0.778 1.000 1.000 0.849 0.536 (0.253, 0.819)Minimum 0.439 1.000 1.000 0.619 0.400 (0.269, 0.531)Range 0.067 1.000 1.000 0.173 0.736 (0.507, 0.966)Robust MAD 0.035 1.000 1.000 0.139 0.773 (0.548, 0.997)RMS 0.105 1.000 1.000 0.253 0.709 (0.452, 0.967)Skewness 0.205 1.000 1.000 0.369 0.664 (0.408, 0.919)Total Energy 0.622 1.000 1.000 0.739 0.564 (0.300, 0.828)Uniformity 0.139 1.000 1.000 0.313 0.691 (0.448, 0.934)

ADC


FF

Variance 0.657 1.000 1.000 0.739 0.563 (0.278, 0.847)

38

The AUC determined from a ROC curve is representative of how good an attribute may be. Neverthe-

less, the same AUC can be achieved with different traced curves. The points (TPR, FPR) that build the

ROC curve of each attribute of interest need to undergo further examinationin order to find the variables

that display the best balance between sensitivity and specificity. There is the general goal of maximizing

the TPR while minimizing the FPR, which would correspond to a point close to the upper left corner of

the graphic.

In Table 4.10 are the attributes which are considered to have the best TPR and FPR balance given

the problem at hands (obtained by the method briefly in the Subsection 3.3.3), the threshold at which

these rates occur, as well as other performance metrics associated with each attribute performance

(accuracy, precision and F1-measure).

It is important to mention that only attributes that remained statistically significant after the multiple

correction analysis are presented in this section. The imaging features found interesting when consid-

ering the ROC curve alone are displayed in the Appendix D.1 along with their associated performance

metrics.

The attributes depicted in Table 4.10 will be referred as key attributes from this point on. The ROC

curve associated with each key attribute presented in Table 4.10 is available in the Appendix D.2. The

second part of the Table 4.10 contains the AUC values, with the respective 95% confidence interval, and

the p-values before and after the multiple comparison corrections for each key attribute.

Table 4.10: Summary of the attributes considered statically relevant through the p-value evaluation andthe ROC curve with their respective true positive rate (TPR), false positive rate (FPR), threshold at whichthis rates are verified, accuracy, precision, F1-measure (F1), AUC (with the respective 95% confidenceinterval) and p-values before and after the multiple comparison tests.

Data set Scenario Attribute TPR FPR Threshold Accuracy Precision F1

ADC PreTreat 1 Kurtosis 0.667 0.087 3.30 0.862 0.667 0.667

FF PreTreat 1

90 Percentile 0.667 0.091 19.0 0.857 0.667 0.667

Median 0.833 0.182 11.0 0.821 0.556 0.667

RMS 0.833 0.182 17.4 0.821 0.556 0.667

Skewness 0.833 0.182 0.624 0.821 0.556 0.667

Total Energy 0.833 0.182 2.72×107 0.821 0.556 0.667

Data set Scenario Attribute AUCp-value

MW U-Test Bonferroni Holm BH

ADC PreTreat 1 Kurtosis 0.855 (0.679, 1.000) 0.008 0.300 0.283 0.042

FF PreTreat 1

90 Percentile 0.879 (0.747, 1.000) 0.005 0.184 0.184 0.042

Median 0.856 (0.698, 1.000) 0.009 0.306 0.283 0.042

RMS 0.856 (0.704, 1.000) 0.009 0.306 0.283 0.042

Skewness 0.856 (0.702, 1.000) 0.009 0.306 0.283 0.042

Total Energy 0.864 (0.703, 1.000) 0.007 0.259 0.252 0.042

One may observe that in Table 4.10 there are two attributes with the same TPR (0.667) but distinct

39

FPR. The attribute kurtosis from the ADC map displays a slightly inferior FPR when compared to the

attribute 90 percentile from the FF map (-0.004). Nonetheless, the attribute 90 percentile displays the

one of the best AUC from all the attributes evaluated, particularly when considering the confidence

interval. With these aspects considered, it was decided by the author to further explore these two key

attributes.

From the remaining 8 attributes that survived the Benjamini-Hochberg correction, all presented a

threshold at which the TPR is equal to 0.833 and the FPR is equal to 0.182. The criteria that led to

the selection of the 4 attributes depicted in Table 4.10 with this TPR and FPR was the AUC value and

the p-value after the Mann-Whitney U-Test. The attributes selected have an AUC over 0.850, with a

confidence interval ranging from 0.70 to 1.00, and have an associated p-value under 0.01.

Through Table 4.10 it is possible to see that the increase of the TPR is associated with the increase

in the FPR.

The contribution of the p-value and ROC curve allows the inference of which attributes and which

data sets could be more robust in distinguishing both classes (responders vs non-responders).

Since all the key attributes belong to scenario 1, the initial main goal would be to minimize the FPR, in

other words, it would be to avoid the misclassification of a responsive patient as a non-responder, as this

mistake could lead to the interruption of a working treatment. The class non-responders is composed by

6 patients in both data sets, while the class responders is composed by 22 or 23 patients, depending if

one is considering the FF or ADC data set, respectively. The TPR of 0.833 and 0.667 correspond to the

misclassification of 1 and 2 patients, respectively, that belong to the non-responders class. While, the

FPR of 0.087 or 0.091 and 0.182 correspond to the misclassification of 2 and 4 patients, respectively,

that belong to the responders population. What one should consider is if the misclassification of one

more patient as responders is better than misclassifying two patients more as non-responders. In the

first case, one more patient may undergo unnecessary debilitating procedures, as on the second case,

two patients would have the treatment adjusted, which could lead to a step back in the fight against the

disease. The other performance metrics (accuracy, precision and F1-measure), show better results in

the first case (TPR equal to 0.667).

4.1.2 Detailed Analysis of the Key Attributes

The patients are divided between responders and non-responders. The evaluation of class separa-

tion considering the median of each population was translated in the p-values attained with the Mann-

Whitney U-Test. For the key attributes is explicitly shown in Table 4.11 the mean and standard deviation

of each group, as well as the range of values taken by each variable.

It is expected, considering the anterior analysis made in the Subsection 4.1.1, that these attributes

display a good separation between the two populations. Considering the Table 4.11, it is possible to have

an idea of the distance between the means of each population, as well as an notion of class dispersion

when the standard deviation is added and the range considered.

A well known method for graphically depicting groups of numerical data distribution through their

quartiles is the Box and Whiskers plot. This facilitates the visualization of the class values’ separation

40

Table 4.11: Mean associated with the key attributes for each class (responders and non-responders),with the respective standard deviation (SD), and the range of values within each key attribute is located.

Data set Key attributeMean ± SD

RangeResponders Non-Responders

ADC PreTreat Kurtosis 2.82± 0.34 3.50± 0.62 (2.12, 4.56)

FF PreTreat

90 Percentile 38.3± 11.4 19.8± 12.3 (5.0, 47.0)

Median 27.1± 13.7 6.8± 11.0 (0, 41)

RMS 28.7± 10.7 17.7± 8.9 (4.12, 40.43)

Skewness −0.22± 1.07 1.60± 1.21 (−1.22, 3.47)

Total Energy 1.40× 108 ± 9.43× 107 3.30× 107 ± 5.34× 107 (1.49× 106, 2.49× 108)

(location), as well as it might enable the evaluation in each individual population of outliers’ existence

and data dispersion, for instance. The box and whisker plots for the key attributes are depicted in Figure

4.1.

The box represents the observations in between the lower (Q1, 25th percentile) and upper quartile

(Q3, 75th percentile), which is denominated as the interquartile range (IQR = Q3 − Q1). The median

(Q2, 50th percentile) is traced inside it. The whiskers are defined as the interval of variability of the

observations. The upper extreme can be defined as minimum between Q3+1.5 ·IQR and the maximum

value of the class observations and the lower extreme can be defined as maximum between Q1 − 1.5 ·

IQR and the minimum value of the class observations. Outside the whiskers, the observations are

considered outliers.

All the values for the median, upper and lower quartiles and upper and lower extremes in each case

are presented in Table D.2 (Appendix D.3).

It is important to keep in mind the discrepancy between the number of patients in each class in

order to have a better understanding of the results depicted in each box and whisker plot. Since all

the key attributes belong to scenario 1, the class responders corresponds to approximately 80% of the

patients present in this study, while the remaining 20% belong to the class non-responders, as mentioned

previously. This fact can be particularly relevant when analysing class dispersion.

As previously mentioned, the box and whisker plot is a graphic representation of the data distribution.

As a general appreciation, one may say that all the box and whisker plots appear to be in accordance

with the results previously found, which indicate that these attributes have discriminatory power when

distinguishing the two populations (responders and non-responders).

Although there is no attribute that shows a perfect separation between classes, one might observe

that the key attributes seem to display a good separation. All the key attributes in scenario 1 referent to

the FF map on the pre-treatment stage show an apparent absence of common values in the IQR.

Through the box and whisker plot (Figure 4.1), it appears that the classes have distinct distributions

and locations when considering the key attribute 90 percentile (FF PreTreat 1). The distribution re-

garding the non-responders class is broader when compared with the responders class. The interval

of signal intensities from a FF map ranges from 0 to 50. There is an apparent distinct tendency toward

lower observation values of this attribute from the non-responders class and toward higher observation

41

Figure 4.1: Comparison of differences in signal intensity parameters collected from ADC and FF mapsbetween responders and non-responders for the key attribute A Kurtosis in the ADC pre-treatment dataset in scenario 1 (ADC PreTreat 1), B 90 Percentile in the FF pre-treatment data set in scenario 1(FF PreTreat 1), C Median in the FF pre-treatment data set in scenario 1, D Skewness in the FF pre-treatment data set in scenario 1, E Root Mean Squared in the FF pre-treatment data set in scenario 1and F Total Energy in the FF pre-treatment data set in scenario 1. The boundaries of the box show 25th(Q1) and 75th (Q3) percentiles, and the line within the box is the median. The whiskers are defined asthe interval of variability of the observations. The upper extreme can be defined as minimum betweenQ3 + 1.5 · IQR and the maximum value of the class observations and the lower extreme can be definedas maximum between Q1 − 1.5 · IQR and the minimum value of the class observations.

42

values from the responders class.

Through the box and whisker plots (Figure 4.1), it appears that the classes have distinct distributions

and locations when considering individually the key attributes median, robust mean squared and total

energy (FF PreTreat 1). The classes in these three graphics seem to display a similar relative position,

with the non-responders group tending to lower observation values, while the responders group appears

to display a tendency to higher observation values, similarly to what happens with the 90 percentile

attribute. In addition, the distribution in the responders group appears to be broader when compared to

the non-responders group. There is a easily identifiable outlier in the non-responders group in all these

three plots and it belong to the same patient (patient 18, who was classified as having a stable disease).

Through the box and whisker plot (Figure 4.1), it appears that the classes have distinct locations when

considering the key attribute skewness (FF PreTreat 1). The skewness from the non-responders class

comprehends mostly positive values, which is an indication of a positively skewed distribution, specially

if the values are superior to 0.5. The skewness from the responders class comprehends mostly negative

values, which is an indication of a negatively skewed distribution, specially if the values are inferior to -

0.5.

43

4.1.3 Clinical Variables

The univariate analysis developed in this work has its the main focus on the preliminary discovery of an

accurate imaging biomarker in the prediction of treatment response status in an early phase of induction

chemotherapy. Nevertheless, it was thought that the available clinical data from the patients in this study

might also be a source of possible biomarkers.

Identically to the procedure followed when analysing the imaging features, the clinical data set was

submitted to the Mann-Whitney U-Test and to the multiple comparison corrections, as well as to the AUC

evaluation in all scenarios (1, 2 and 3). The results are depicted in Tables 4.12 to 4.14.

Table 4.12: p-values achieved from the Mann-Whitney U-Test for each attribute of interest before andafter the implementation of the multiple comparison corrections and the AUC value obtained using theROC curve, with the respective 95% confidence interval, regarding each attribute of the clinical dataset in scenario 1. These results were achieved using RapidMiner® and RStudio®. The p-values underthe 0.05 significance level are stressed in the color blue. The AUC values over 0.70 are stressed in thecolor orange.

Attributep-values


Age at Rx 0.351 1.000 1.000 0.576 0.625 (0.301, 0.949)Weight 0.014 0.317 0.290 0.091 0.830 (0.668, 0.992)Height 0.074 1.000 1.000 0.212 0.740 (0.515, 0.964)Gender 0.120 1.000 1.000 0.306 0.708 (0.516, 0.900)ECOGPS 0.736 1.000 1.000 0.770 0.545 (0.286, 0.804)Hb 0.016 0.366 0.318 0.091 0.823 (0.630, 1.000)WBC 0.337 1.000 1.000 0.576 0.628 (0.390, 0.867)PMNs 0.517 1.000 1.000 0.646 0.587 (0.299, 0.875)Platelet 0.378 1.000 1.000 0.580 0.618 (0.280, 0.956)Albumin 0.452 1.000 1.000 0.633 0.601 (0.309, 0.893)Creatinine 0.641 1.000 1.000 0.702 0.563 (0.277, 0.849)Urea 0.534 1.000 1.000 0.646 0.583 (0.352, 0.815)Calcium 0.641 1.000 1.000 0.702 0.563 (0.263, 0.862)Alkaline Phosphatase 0.468 1.000 1.000 0.633 0.597 (0.291, 0.903)β2M 0.023 0.518 0.428 0.104 0.806 (0.579, 1.000)LDH 0.005 0.118 0.113 0.059 0.906 (0.718, 1.000)Actual Percentage 0.058 1.000 0.993 0.192 0.777 (0.587, 0.967)Serum Peak 0.325 1.000 1.000 0.576 0.632 (0.337, 0.927)Serum Immunoglobulin A 0.001 0.030 0.030 0.030 0.452 (0.200, 0.705)Serum Kappa Free 0.325 1.000 1.000 0.509 0.569 (0.275, 0.863)Serum Lambda Free 0.325 1.000 1.000 1.000 0.623 (0.328, 0.918)Ratioκλ 0.325 1.000 1.000 0.428 0.602 (0.306, 0.899)ISS 0.038 0.876 0.686 0.146 0.778 (0.612, 0.944)

In the clinical data 6 attributes show statistical significance in scenario 1 (weight, Hb, β2M , LDH,

serum immunoglobulin A and ISS), 1 attribute is considered statistically significant in scenario 2 (weight)

and 1 attribute displays statistical relevance in scenario 3 (serum lambda free) after the Mann-Whitney

U-Test. Only the attribute serum immunoglobulin A remains statistically significant when submitted to

the multiple comparison corrections. Some attributes displays an AUC over 0.70, 7 attributes in scenario

1 (weight, height, gender, Hb, β2M , LDH, and ISS), 1 in scenario 2 (weight) and 3 in scenario 3 (serum

44


Attributep-values


Age at Rx 0.525 1.000 1.000 0.837 0.431 (0.208, 0.653)Weight 0.044 1.000 1.000 0.653 0.720 (0.5334, 0.9064)Height 0.472 1.000 1.000 0.837 0.421 (0.200, 0.642)Gender 0.525 1.000 1.000 0.837 0.569 (0.382, 0.757)ECOGPS 0.933 1.000 1.000 0.933 0.491 (0.287, 0.694)Hb 0.498 1.000 1.000 0.837 0.574 (0.353, 0.795)WBC 0.933 1.000 1.000 0.933 0.509 (0.284, 0.735)PMNs 0.719 1.000 1.000 0.933 0.539 (0.323, 0.756)Platelet 0.320 1.000 1.000 0.837 0.609 (0.389, 0.828)Albumin 0.525 1.000 1.000 0.837 0.569 (0.344, 0.795)Creatinine 0.446 1.000 1.000 0.837 0.583 (0.363, 0.804)Urea 0.421 1.000 1.000 0.837 0.588 (0.373, 0.803)Calcium 0.882 1.000 1.000 0.933 0.484 (0.251, 0.716)Alkaline Phosphatase 0.849 1.000 1.000 0.933 0.521 (0.303, 0.739)β2M 0.446 1.000 1.000 0.837 0.583 (0.370, 0.797)LDH 0.899 1.000 1.000 0.933 0.444 (0.222, 0.667)Actual Percentage 0.899 1.000 1.000 0.933 0.550 (0.306, 0.794)Serum Peak 0.568 1.000 1.000 0.837 0.563 (0.343, 0.782)Serum Immunoglobulin A 0.525 1.000 1.000 0.837 0.587 (0.340, 0.834)Serum Kappa Free 0.352 1.000 1.000 0.837 0.646 (0.440, 0.853)Serum Lambda Free 0.057 1.000 1.000 0.653 0.672 (0.469, 0.875)Ratio κλ 0.409 1.000 1.000 0.837 0.671 (0.467, 0.875)ISS 0.582 1.000 1.000 0.837 0.560 (0.354, 0.766)

kappa free, serum lambda free and ratio κ λ). It is worth mentioning, that the only attribute that survives

the multiple comparison corrections does not have an AUC over 0.70, in fact the serum immunoglobulin

A displays a low AUC value equal to 0.452.

From the attributes in the clinical data set, all of those with AUC over 0.70 were evaluated in order to

find the ones that display the best balance between TPR and FPR. In Table 4.15, similarly to what was

done for the imaging features, it is summarize the p-values and AUC values, as well as all the statistical

parameters of interest, such as the true positive rate, false positive rate, threshold, accuracy, precision

and F1-measure. In this particular case, the attribute LDH displayed a better performance than any other

attribute.

It is worth mentioning, that, although the attribute LDH displays very good performance metrics, it

does not survive the multiple comparison corrections, therefore there is a high probability that it showed

statistical significance after the Mann-Whitney U-test only by chance.

For both attributes, LDH and serum immnunoglobulin A, in Table 4.16 is presented the mean and

standard deviation with respect to each class, as well as the range of values taken by each variable, and

45


Attributep-values


Age at Rx 0.840 1.000 1.000 0.997 0.476 (0.228, 0.724)Weight 0.260 1.000 1.000 0.997 0.635 (0.403, 0.868)Height 0.885 1.000 1.000 0.997 0.483 (0.237, 0.728)Gender 1.000 1.000 1.000 1.000 0.500 (0.294, 0.706)ECOGPS 0.954 1.000 1.000 0.997 0.507 (0.284, 0.730)Hb 0.729 1.000 1.000 0.997 0.542 (0.293, 0.790)WBC 0.908 1.000 1.000 0.997 0.514 (0.266, 0.762)PMNs 0.470 1.000 1.000 0.997 0.587 (0.348, 0.825)Platelet 0.141 1.000 1.000 0.810 0.677 (0.446, 0.909)Albumin 0.665 1.000 1.000 0.997 0.552 (0.307, 0.798)Creatinine 0.488 1.000 1.000 0.997 0.583 (0.340, 0.826)Urea 0.273 1.000 1.000 0.997 0.632 (0.385, 0.879)Calcium 0.840 1.000 1.000 0.997 0.476 (0.226, 0.726)Alkaline Phosphatase 0.840 1.000 1.000 0.997 0.524 (0.276, 0.772)β2M 0.817 1.000 1.000 0.997 0.528 (0.276, 0.779)LDH 0.931 1.000 1.000 0.997 0.610 (0.357, 0.863)Actual Percentage 0.544 1.000 1.000 0.997 0.454 (0.193, 0.716)Serum Peak 0.840 1.000 1.000 0.997 0.524 (0.277, 0.772)Serum Immunoglobulin A 0.908 1.000 1.000 0.997 0.582 (0.326, 0.837)Serum Kappa Free 0.119 1.000 1.000 0.810 0.705 (0.481, 0.928)Serum Lambda Free 0.006 0.128 0.128 0.128 0.765 (0.566, 0.965)Ratio κλ 0.126 1.000 1.000 0.810 0.764 (0.556, 0.973)ISS 0.686 1.000 1.000 0.997 0.652 (0.437, 0.867)

in Figure 4.2 are displayed the box and whiskers plot that depicts the separation of both populations. All

the values for the median, upper and lower quartiles and upper and lower extremes in each case are

presented in Table D.2 (Appendix D.3).

Both the data presented in Table 4.16 and the box and whisker plot presented in Figure 4.2A seem

to support a good class division for the attribute LDH. The interval of LDH observations ranges between

100 and 277. In the box and whisker plot, both populations present distinct distributions and locations.

The non-responders class appears to comprehend overall higher values and a more dispersed collection

of observations when compared to the responders class. One may say that there exists a significant

separation among classes. Nevertheless, it is important to keep in mind that the non-responders class

is constituted by 6 patients, while the responders class is composed by 23 patients. The class imbalance

may contribute to data artifacts

As it is possible to visualize in the box and whisker plot which depicts the attribute serum immonu-

globulin A (Figure 4.1B), in the class responders there is a considerable amount of outliers. In order to

better understand the information present in the box and whiskers plot, the graphic was amplified (Figure

46

Table 4.15: Summary of the attributes considered statically relevant through the Mann-Whitney U-Testand the ROC curve with their respective true positive rate (TPR), false positive rate (FPR), threshold atwhich this rates are verified, accuracy, precision, F1-measure (F1), AUC score with the respective 95%confidence interval and p-values before and after the multiple comparison tests.

Data set Scenario Attribute TPR FPR Threshold Accuracy Precision F1

Clinical 1 LDH 0.833 0 195.5 0.966 1.000 0.909

Data set Scenario Attribute AUCp-value

Mann-Whitney U-Test Bonferroni Holm BH

Clinical 1 LDH 0.906 (0.718-1) 0.005 0.118 0.113 0.059

Table 4.16: Mean associated with the clinical attributes LDH and serum immunoglobulin A (Serum IgA)for each class (responders and non-responders), with the respective standard deviation (SD), and therange of values within each attribute is located.

AttributeMean

RangeResponders Non-Responders

LDH 151± 23 224± 45 100− 277

Serum IgA 34± 8 226± 670 23− 3180

Figure 4.2: Box and whisker plot for the clinical attributes A LDH and B Serum Immunoglobulin A (SerumIgA) in scenario 1. The boundaries of the box show 25th (Q1) and 75th (Q3) percentiles, and the linewithin the box is the median. The whiskers are defined as the interval of variability of the observations.The upper extreme can be defined as minimum between Q3 + 1.5 · IQR and the maximum value of theclass observations and the lower extreme can be defined as maximum between Q1 − 1.5 · IQR and theminimum value of the class observations.

4.3).

It is not possible to visualize a good class separation for the serum immunoglobulin A in the am-

plified box and whiskers plot (Figure 4.3). The good results obtained in the p-value evaluation may be

influenced by the presence of outliers in the class responders (3 outliers), by the lack of data regarding

5 patients and by the class imbalance that characterizes the data set in this scenario (only 16% of the

patients are non-responders, 4 out of 25).

47

Figure 4.3: Box and whisker plot for the clinical attribute Serum Immunoglobulin A (Serum IgA) in sce-nario 1 amplified. The boundaries of the box show 25th (Q1) and 75th (Q3) percentiles, and the linewithin the box is the median. The whiskers are defined as the interval of variability of the observations.The upper extreme can be defined as minimum between Q3 + 1.5 · IQR and the maximum value of theclass observations and the lower extreme can be defined as maximum between Q1 − 1.5 · IQR and theminimum value of the class observations.

4.2 Density Plots Comparative Analysis Over the First Round of

Chemotherapy

The classification of the response to treatment among patients ranges from 1 to 6 with distinct levels

of response. One may think that these patients might show different patterns in the data collected

according with their response to treatment. There exists the hypothesis that a patient with complete

response to treatment or progressive disease would show a greater change in the aspect of its’ spine

vertebrae through the MRI images than a patient with partial response or stable disease. In other words,

an increased variation in the signal intensity extracted from the ADC and FF maps would be registered

in cases with a more significant change in the spine condition.

The probability density function specifies for a continuous random variable the distribution of the

density over the range of values taken by the variable. This function can be graphically depicted and it

constitutes an accurate representation of the distribution of the numerical data.

The density plots displayed in Figures 4.4 through 4.9 depict the overlay of the signal intensity distri-

bution collected from the ADC and FF maps before and after the first round of the chemotherapy from 3

different patients (2, 6 and 22), all with a different classification in the response to treatment according

to the IMWG guidelines (6, 4 and 1, respectively).

The characteristics that describe the distribution of the signal intensities measured directly from the

parametric maps in the two moments of image acquisition are summarized in Table 4.17.

From Table 4.17 it is possible to observe that the kurtosis regarding patient 2 and 6 is always higher

than 3 (5.15 to 17.8), while the kurtosis concerning patient 22 is always below 3 (1.97 - 2.66). A kurtosis

equal to 3 corresponds to a normal distribution of the data, as mention previously on section 2.4. These

results indicate that the signal distributions from the patients 2 and 6 present longer and broader tails

when compared with a normal distribution, which indicates the scatterness of the data; there is an

48

Table 4.17: Statistical metrics summary that characterize the density plots depicted in the images 4.4to 4.9, namely: mean ± standard deviation (SD), kurtosis (kurt), skewness (skew), median, minimum(min) and maximum (max); with the respective patient’s identification, the map and phase of treatmentto which the signal intensities evaluated correspond to, the classification of the response to treatmentand Figure (fig) identification from the respective density plot.

Patient

IDMap

Treatment

phase

Statistical Metrics Response to

TreatmentFig

Mean ± SD Kurt Skew Median Min Max

2

ADCPre 295 ± 86 9.27 1.65 281 0 962

6

4.4Post 340 ± 114 5.44 1.29 316 0 953

FFPre 8.25 ± 9.63 6.35 1.83 6.00 0 50.0

4.5Post 3.69 ± 6.19 17.8 3.43 2.00 0 50.0

6

ADCPre 189 ± 149 5.15 1.24 168 0 975

4

4.6Post 147 ± 162 6.96 1.93 108 0 886

FFPre 38.0 ± 8.2 6.45 - 1.62 40.0 0 50.0

4.7Post 37.3 ± 8.3 5.99 - 1.46 39.0 0 50.0

22

ADCPre 682 ± 347 2.66 0.198 664 0 1999

1

4.8Post 684 ± 402 2.46 0.379 634 0 2030

FFPre 23.0 ± 14.1 1.97 - 0.242 25.0 0 50.0

4.9Post 27.1 ± 14.2 2.22 - 0.584 30.0 0 50.0

increase of the kurtosis value with the increase of the outliers’ number. On the other hand, the signals

collected from patient 22 display a distribution with shorter and thinners tails when compared with a

normal distribution.

The skewness is always greater than 1 in absolute value in patients 2 and 6, which reflects a distri-

bution considered highly skewed. In patient 2 the skewness is positive in both ADC and FF maps, while

for patient 6 the skewness concerning the ADC map is positive and concerning the FF map is negative.

A positive or negative skewness reflects the concentration of the signal intensity values to the left or to

the right, respectively. The skewness from the data collected from patient 22 is always inferior to 1 and

in the majority of the cases under 0.5, which indicate a moderate to non-significant asymmetry in the

distribution of the signal intensity data collected.

The range of the signal intensities values from the ADC of FF maps is considerably different. The

FF maps have intensities ranging from 0 to 50 in all cases, while the values in the ADC maps are in

between 0 and, approximately, 1000 for patients 2 and 6 and in between 0 and, approximately, 2000 for

patient 22. Considering only the FF maps, patients 2 and 22 present a similar signal intensity evolution

in absolute value from the pre-treatment to the post-1stcycle moment; both mean and median suffer a

variation in between 4 and 5 units (8% to 10% alteration in signal intensity), that may be considered

relevant given the range of intensities in these maps (0 to 50). The alterations regarding patient 6 in

terms of the mean and median from the pre-treatment to the post-1stcycle moments are less evident

(in between 0.7 and 1 units, that corresponds to a variation inferior to 1%). Considering only the ADC

49

maps, patients 2 and 6 have a similar evolution from the pre-treatment to the post-1stcycle moment, in

absolute value, both the mean and median show a variation in between 35 and 60 units (3.5% to 6.0%

alteration in signal intensity), that might be considered relevant given the range of intensities (0 to 1000).

The alterations regarding patient 22 in terms of the mean and median from the pre-treatment to the

post-1stcycle moment are much less evident (in between 2 and 30 units, that corresponds to a variation

inferior to 1.5%) given the signal intensities range (0 to 2000).

Finally, from the observation of the graphics alone, it appears that patients 2 and 22, classified as

having progressive disease and stringent complete response, respectively, have more distinct graphics

when comparing the pre-treatment against the post-1stcycle signal intensities collected.

50

Figure 4.4: Density plot relative to the signal inten-sities removed from the ADC maps from patient 2,whom presents a response classification of 6 tothe induction therapy. Legend: blue - pre-treatmentand orange - post-1stcycle.

Figure 4.5: Density plot relative to the signal inten-sities removed from the FF maps from patient 2,whom presents a response classification of 6 tothe induction therapy. Legend: blue - pre-treatmentand orange - post-1stcycle.

Figure 4.6: Density plot relative to the signal inten-sities removed from the ADC maps from patient 6,whom presents a response classification of 4 tothe induction therapy. Legend: blue - pre-treatmentand orange - post-1stcycle.


Figure 4.8: Density plot relative to the signal intensi-ties removed from the ADC maps from patient 22,whom presents a response classification of 1 tothe induction therapy. Legend: blue - pre-treatmentand orange - post-1stcycle.


51

5 Discussion

MRI is being increasingly used for the initial evaluation of patients with MM, as well as for monitoring the

disease’s progression. This non-invasive method may be used either qualitatively or quantitatively with

promising results. [13] [14]

In this study, the primary goal was to investigate the possibility of using first-order statistics features

extracted from ADC and FF maps as imaging biomarkers in the early prediction of treatment response

in multiple myeloma.

There are several reports that attest the reproducibility of the measurement of signal FF [14] [63]

and the repeatability of ADC measurements [12]. The differentiation between responders and non-

responders in MM treatment is a common practice. [12] [13] [14] [63]

As one may observe from this study, from the 35 (out of 216) imaging features found statistically

significant after the Mann-Whitney U-Test, only 10 survive the Benjamini-Hochberg procedure and no

attribute survived the more conservative approaches. When evaluating the clinical data set, 8 (out of

the 69) clinical variables were found statistically significant after the Mann-Whitney U-Test and only

1 attribute survived the multiple comparison corrections. The application of the multiple comparison

corrections has the underlying objective of avoiding overoptimistic results. Due to the large number

of variables evaluated, the possible discovery of attributes found statistically significant only by chance

needs to be taken into consideration.

Interestingly, all the 11 attributes that survived the BH correction belong to scenario 1. This is the

only scenario were patients with partial response are classified as responders. Both scenarios 2 and 3

consider the patients with partial response to treatment as belonging to the non-responders groups, al-

though scenario 3 excludes from the analysis patients who are classified as having stable or progressive

disease.

In particular, all the imaging features that remain statistically significant after the BH correction at a

significance level of 0.05 belong to the pre-treatment data set. Considering this particular data set, one

may say that scenario 1 displays best overall results when compared with scenario 2. In this comparison,

the definition of best results is the existence of a higher number of attributes found statistically significant

and with an AUC value superior to 0.70. This observation could be an indication that patients with partial

response show a signal intensity distribution in the parametric map before the induction treatment closer

to the one displayed by patients with very good partial or complete response. After all, when the patients

with partial response are transferred to the non-responders group the overall attributes’ discriminatory

power decreases. Just still considering the pre-treatment data set, when comparing scenarios 2 and

3, in other words, when the patients with stable or progressive disease are removed from the analysis

and the patients with very good partial or complete response are compared only with the patients with

partial response, the results improve, particularly the results associated with the attributes extracted

from the FF map. This improvement might be an indication that the patients with partial response can

53

be distinguished from the patients with a complete or very good partial response using data acquire

previously to the treatment’s beginning, but that this distinction is more plausible in the absence of

patients that do not respond to the induction therapy. Any possible distinction that could be done a priori

regarding the expected patient’s final response would be a valuable asset in terms of patient care, since

it could potentially avoid the patient’s submission to unnecessary debilitating procedures.

In addition, scenarios 2 and 3 show similar and better results than scenario 1 when considering the

delta data set. This situation may indicate that the variation in signal intensity from the patients with

complete or very good partial response to treatment is distinct from the variation observed in the re-

maining patients. This observation may support the hypothesis that patients with partial response could

be distinguished from patients with complete or very good partial response to treatment, in this particu-

lar case after a single round of induction chemotherapy. This kind of differentiation among responders

might allow the performance of treatment adjustments that would, ideally, allow the optimization of the

treatment route for each patient. Thus, also leading to patient care improvement.

A forth scenario that could be interesting to explore in a bigger data set would be the division of the

patients among 3 classes: patients with complete or very good partial response (treatment’s response

classification: 1, 2 and 3), patients with partial response (treatment’s response classification: 4) and

patients with no positive response (treatment’s response classification: 5 and 6).

There are six attributes from the MR images which are considered to best predict the final treatment

response considering the p-value evaluation and the ROC curve analysis. All these key attributes survive

the Benjamini-Hochberg correction at a significance level of 0.05 and are associated with an AUC value

superior to 0.850. These attributes are: kurtosis from the pre-treatment ADC map in scenario 1 [P =

0.042, AUC = 0.855 (0.679 − 1.00)] and 90 percentile [P = 0.042, AUC = 0.879 (0.747 − 1)], median

[P = 0.042, AUC = 0.856 (0.698 − 1)], root mean squared [P = 0.042, AUC = 0.856 (0.740 − 1)],

skewness [P = 0.042, AUC = 0.856 (0.702 − 1)] and total energy [P = 0.042, AUC = 0.864 (0.703 − 1)]

from the pre-treatment FF map in scenario 1.

From the more detailed analysis with the resource to the box and whisker plots, one may say that

the key attributes from the pre-treatment FF map appear to display some of the best separations be-

tween classes when compared with the remaining key attribute. This result underlines the possibility of

using imaging features extracted before the beginning of treatment as imaging biomarkers, with special

focus on the FF maps, with sensitivity superior to 66% and specificity superior to 81%. If any of these

biomarkers would be validated as reliable in a larger cohort of patients, this result is very optimistic in

patient care. As mentioned above, all the key attributes belong to scenario 1, as a consequence, these

key imaging features could lead to the discrimination between patients who achieved a minimum of par-

tial response and patients non-responsive to treatment before the treatment has started. Thus, these

biomarkers could prevent the patients from undergoing in unnecessary debilitating procedures.

It was mentioned previously that the data sets displayed unbalanced class proportion and, as a

consequence, the performance metric accuracy could indicate misleading results. Considering this

information, other performance metrics were estimated and, as it can be seen in Table 4.10, most of

the attributes in scenario 1 display a higher accuracy than precision or F1-measure, with a difference

54

ranging between 19.5 and 26.5 percentage points. This difference may reflect an optimistic estimation

of the accuracy due to the considerable lower amount of positively labeled data (approximately 20%

positive observations).

In Appendix D.1 are displayed the attributes that were considered good based on the AUC analy-

sis alone, but that did not survive the multiple comparison corrections, which lead to their exclusion as

possible key attributes. These attributes are found across the three scenarios evaluated. These sce-

narios have differently balanced data set, being scenario 1 the most unbalanced and scenario 3 the

most balanced, with almost a 50/50 patient’s division among classes. One may observe that the perfor-

mance metrics accuracy, precision and F1-measure get successively closer in value. This result support

the initial permissive that the evaluation of an attribute performance should be done by more than one

performance metric.

Giles (2014) analyzes the evolution of the ADC signal intensity in patients with multiple myeloma.

They suggest that mean signal intensity from the ADC map increased in responders but not in non-

responders over the course of three cycles of chemotherapy. [12] They consider patients with partial

response as responders, which corresponds to scenario 1 in this work. The overall increase or stability

in the ADC signal intensity in both classes is also reported by Latifoltojar (2016 and 2017). [13] [14]

Latifoltojar (2017) also analyzes the FF signal intensity in patients with multiple myeloma, using

whole-body MRI, focused on focal lesions. The classes responders and non-responders were defined as

in scenario 1. In the class responders they found significant changes on the mean signal intensity, over

the course of two cycles of chemotherapy. While in the class non-responders, no significant alterations

were found. These authors acknowledge stronger alterations regarding the signal FF, that is considered

to be a best biomarker against ADC. [14] This conclusion is sustained by a previous study of the same

authors, where significant increases in signal FF after 8 weeks of treatment show the potential of early

signal FF changes in the prognostic of MM. [13] Both studies are focused on focal lesions as they are

recognized to be more relevant to disease pathogenesis and risk assignment, when compared to diffuse

marrow signal abnormalities. [64]

There are two main differences between this study and the ones cited above. While in this work only

some vertebrae from the spine were analyzed and the data regarding post-1stcycle was collected after

one round of chemotherapy, in the study conducted by Giles (2014) and Latifoltojar (2016 and 2017)

they did whole-body DWI and the period of treatment analyzed is superior.

The studies cited [12] [14] aim to study the overall signal intensity changes as potential biomarkers

for MM, focused on the mean. In this thesis several first order imaging features were evaluated individ-

ually through an univariate analysis. The comparison that can be made between studies is therefore

limited. In this work, the specific attribute mean was not found statistically significant after the multiple

comparison corrections in any of the analysis conducted considering the evolution of signal intensity

(delta). When considering the pre-treatment data set regarding the FF map in scenario 1, this first order

statistic displays statistical significance after the multiple comparison analysis at the significance level

of 0.05 and an AUC values of 0.841. Finally, there are several patterns associated with MM (such as

the existence of focal lesions or the predominance of diffuse infiltration), which may condition the water

55

dispersion and, consequently, the intensity of the signals acquired. [62]

In addition to the main goal of this study, it was conducted the evaluation of possible clinical biomark-

ers and the comparison between the data collected from the pre-treatment and post-1stcycle signal

intensities extracted from the ADC and FF maps from patients with different responses to treatment.

From the clinical variables studied, there is one clinical variable that revealed particular interest: the

serum lactate dehydrogenase (LDH) considering the very good AUC value [AUC = 0.906 (0.718 − 1)],

which reflects great performance metrics. This attribute is associated with a sensitivity of 83.3% and a

specificity of 100% at the threshold to 195.5. Nevertheless, this variable does not survive the multiple

comparison corrections, what may indicate that the LDH may be considered statistically significant after

the Mann-Whitney U-Test only by chance (P = 0.005). There are several references that support the

LDH as valuable attribute in the prediction of disease progression in untreated MM patients. [65] [66]

[67] High LDH levels are usually associated with disease proliferation and aggressiveness, therefore it is

correlated with a negative outcome regarding response to treatment. [68] This result indicates that the

combination between clinical variables and imaging features may be interesting to exploit in future work

with a bigger cohort of patients.

According with the density plots depicted for the analysis between the two moments analyzed in this

study (before and after the first round of chemotherapy), it appears that in most cases exists a visual

variation in the signal intensity registered in the ADC and FF maps, described also in the variation of the

statistical metrics associated.

The lesions that occur on multiple myeloma are associated with high cellularity and high water con-

tent. In the density plots regarding the FF signal on the different patients, there seems to be a good

relation between the variation in signal intensity and the evolution of the disease. The signal shifting

along the x-axis is coherent with the accumulation of water in the lesions sites on MM patients. The

increase in water molecules with the lesions’ evolution should lead to a decrease in the fat fraction on

these sites. Therefore, an increase is expected in the overall signal intensity in patients that respond to

treatment (case of patient 22) and a decrease of the overall the signal intensity with progressive disease

(case of the patient 2).

In addition, upon the visualization of the density plots, one may say that the variation on ADC signal

intensity for patient 2 (an example of a patient with progressive disease) and for patient 22 (an example of

a patient with complete stringent response) are coherent with the observations of Messiou (2012), where

it is stated that ADC values are higher in marrow with active myeloma than in marrow in remission. [69]

The fat fraction maps might be considered a more reliable source of information when considering

disease progression. The quantitative parameter FF is a result of the simple estimation of the fat fraction

within the bone marrow. On the other hand the temporal changes on the ADC are dependent on sev-

eral variables, such as perfusion, cellularity, fat fraction and water content. The balance among these

characteristics may lead to an unpredictable result when it comes to what one expects on a ADC map.

For instance, the increase of cellularity associated with multiple myeloma would lead to a decrease of

the ADC; on the other hand, the increase in perfusion, also associated with the progression of multi-

ple myeloma would lead to an increase of the ADC. These conflicting variations, may lead to a poor

56

description of patient’s disease progression, specially when considering patient intervariability. [14] [70]

There are some limitations to this study, one of the biggest being the small group of patients par-

ticipating. The reduced number of patients is a common problem in this type of studies due to the

deteriorated physical condition that they have to endure. The treatment they undergo is very aggressive

and it leaves them fragile and unwilling to participate in additional tests for research purposes. Another

limitation is the arbitrary choice of the moment to perform the second round of image acquisition. Al-

though there are reports that document change in signal intensity, specifically regarding ADC, as early

as one month after the beginning of the treatment [69], the choice for the second moment of image

acquisition may not be optimal to distinguish responders from non-responders. A third limitation found

in this study is the segmentation step, since it is a source of variability in the data obtained from the MR

images.

57

6 Conclusions

There are attributes found in the univariate analysis which present potential in the discrimination between

classes, responders and non-responders, with good sensitivity and specificity rates. Five attributes

display special interest since they may be extracted from the FF maps before the induction treatment:

90 percentile with a sensitivity of 66.7% and a specificity of 90.9% and median, root mean squared,

skewness and total energy with a sensitivity 83.3% and a specificity 81.8%. These results are particularly

interesting, since they indicate that the final response to induction therapy could be predicted before the

treatment starts, which would improve patient care by avoiding unnecessary debilitating procedures.

In addition, the comparison among the three scenarios approached in this study indicates that a

distinction between patients with complete or a very good partial response and patients with partial

response may be possible. This kind of differentiation among responders might allow the performance

of treatment adjustments that would, ideally, permit the optimization of the treatment route for each

patient. Thus, also leading to patient care improvement.

Ideally, there would be a substantial increase in the number of patients available, as this condition is

key to allow the development of a multivariate model.

The univariate analysis is used as a proof of concept, in the sense that it shows the individual

potential of each attribute to predict the final response to treatment of MM. Nevertheless, a more complex

and probably more accurate approach could be achieved with the multivariate analysis.

In the multivariate analysis, with a larger cohort of patients, there is the possibility of combining

different attributes, which is not possible in this study since it would lead to an overfitted classifier. A

first possible step could be combining the features found statistical significant in the univariate analysis,

which includes the combination of imaging features from both ADC and FF maps. In addition, one

may consider the association of clinical biomarkers, with particular interest in the ones that may be

determined resorting to non-invasive techniques.

The first part of the thesis consisted on the extensive creation of segmentation masks for all four

types of MRI images collected (T1-weighted, STIR, In and Opposed phase gradient echo and DWI)

comprehending all visible vertebrae. Besides the 30 patients that were eligible for this study, the MRI

images of other 97 patients were segmented. This work will be used as ground truth in a deep learning

algorithm for automatic segmentation that is being developed by the master student Jose Maria Moreira

at the Champalimaud Foundation.

59

Bibliography

[1] N. C. Institute. What is cancer?, February 2015 (accessed March 2019).

https://www.cancer.gov/about-cancer/understanding/what-is-cancer.

[2] M. Roser and H. Ritchie. Cancer. Our World in Data, 2019 (accessed March 2019).

https://ourworldindata.org/cancer.

[3] W. H. Organization. The top 10 causes of death, May 2018 (accessed March 2019).

https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death.

[4] T. A. C. S. medical and editorial content team. What is multiple myeloma? American Cancer So-

ciety, 2018 (accessed March 2019). https://www.cancer.org/cancer/multiple-myeloma/about/what-

is-multiple-myeloma.html.

[5] C. C. Society. The plasma cells, (Accessed March 2019). http://www.cancer.ca/en/cancer-

information/cancer-type/multiple-myeloma/multiple-myeloma/the-plasma-cells/?region=on.

[6] S. B. A. Baur-Melnyk, H. R. Durr, and M. Reiser. Role of MRI for the diagnosis and prognosis of

multiple myeloma. European journal of radiology, 2005. doi:10.1016/j.ejrad.2005.01.017.

[7] H.-O. voor Volwassenen Nederland (HOVON). M-protein in multiple myeloma, 2004 (accessed

March 2019). http://www.hovon.nl/upload/File/Studies AlgStudInfo HovonRichtlijnenDocs/M-

protein%20in%20MM v02jun04.pdf.

[8] J. C. Dutoit and K. L. Vesstraete. Mri in multiple myeloma: a pictorial review of diagnostic and

post-treatment findings. Insights Imaging, 2016. doi:10.1007/s13244-016-0492-7.

[9] Chemocare. Melphalan, 2002-2019 (accessed March 2019).

http://chemocare.com/chemotherapy/drug-info/Melphalan.aspx.

[10] G. N. Le, J. Bones, M. Coyne, D. Bazou, P. Dowling, P. O’Gorman, and A.-M. Larkin. Current and

future biomarkers for risk-stratification and treatment personalisation in multiple myeloma. Royal

Society of Chemistry, 2018. doi:10.1039/c8mo00193f.

[11] P. Hugo J. W. L. Aerts. The potential of radiomic-based phenotyping in precisionmedicine. a review.

JAMA Oncology, 2016. doi:10.1001/jamaoncol.2016.263.

[12] S. L. Giles, C. Messiou, D. J. Colins, V. A. Morgan, C. J. Simpkin, and et al. Whole-body diffusion-

weighted MR imaging for assessment of treatment response in myeloma. Radiology, 2014.

[13] A. Latifoltojar, M. Hall-Craggs, N. Rabin, R. Popat, A. Bainbridge, and et al. Whole body mag-

netic resonance imaging in newly diagnosed multiple myeloma: early changes in lesional signal fat

fraction predict disease response. British Journal of Haematology, 2016. doi:10.1111/bjh.14401.

61

[14] A. Latifoltojar, M. Hall-Craggs, A. Bainbridge, N. Rabin, R. Popat, and et al. Whole-body MRI

quantitative biomarkers are associated significantly with treatment response in patients with newly

diagnosed symptomatic multiple myeloma following bortezomib induction. Eur. Radiol., 2017.

doi:10.1007/s00330-017-4907-8.

[15] D. Shah, K. Seiter, F. Talavera, and E. C. Besa. Multiple myeloma guidelines. Medscape, 2018

(accessed April 2019). https://emedicine.medscape.com/article/204369-guidelinesg1%20.

[16] P. Moreau, J. S. Miguel, P. Sonneveld, M. V. Mateos, E. Zamagni, and et al. Multiple myeloma

guidelines. ESMO, 2017. doi:10.1093/annonc/mdx096.

[17] R. J. Gillies, P. E. Kinahan, and H. Hricack. Radiomics: Images are more than pictures, they are

data. Radiology, 2016.

[18] H. Collins, S. Calvo, K. Greenberg, L. F. Neall, and S. Morrison. Information needs in the precision

medicine era: How genetics home reference can help. Interactive journal of medical research,

2016. doi:10.2196/ijmr.5199.

[19] V. Kumar, Y. Gu, S. Basu, A. Berglund, S. A. Eschrich, and et al. Radiomics: the process and the

challenges. Magnetic Resonance Imaging, 2012.

[20] L. E. Court, X. Fave, D. Mackin, J. Lee, J. Yang, and L. Zhang. Computational resources for

radiomics. Translational Cancer Research, 2016. 10.21037/tcr.2016.06.17.

[21] D. C. Preston. Magnetic resonance imaging (mri) of the brain and spine: Basics, 2006 (accessed

April 2019). http://casemed.case.edu/clerkships/neurology/web%20neurorad/mri%20basics.htm.

[22] D. J. Bell, J. Jones, and et al. Larmor frequency. Radiopedia, 2005-2019 (accessed April 2019).

https://radiopaedia.org/articles/larmor-frequency.

[23] E. K. Outwater, R. Blasbaig, E. S. Siegelman, and M. Vala. Detection of lipid in abdominal tis-

sues with opposed-phase gradient-echo images at 1.5 t: Techniques and diagnostic importance.

RadioGraphics., 1998. 18:1465-1480.

[24] M. A. Berstein, K. F. King, and X. J. Zhou. Hanbook of MRI Pulse Sequences. Elsevier Academic

Press, 2004. ISBN:978-0120-92861-3.

[25] D. J. Bell, J. Jones, and et al. T1 weighted images. Radiopedia, 2005-2019 (accessed April 2019).

https://radiopaedia.org/articles/t1-weighted-image.

[26] A. Murphy, J. Jones, and et al. T2 weighted images. Radiopedia, 2005-2019 (accessed April 2019).

https://radiopaedia.org/articles/t2-weighted-image.

[27] R. Sharma, Mohammad, and et al. Short tau inversion recovery. Radiopedia, 2005-2019 (accessed

April 2019). https://radiopaedia.org/articles/short-tau-inversion-recovery.

[28] E. Placidi. Magnetic resonance imaging of colonic function. PhD thesis, University of Nottingham,

2011.

62

[29] A. S. Shetty, A. L. Sipe, M. Zulfiqar, R. Tsai, D. A. Raptis, and et al. In-phase and opposed-phase

imaging: Applications of chemical shift and magnetic susceptibility in the chest and abdomen.

RadioGraphics, 2018 (accessed April 2019). https://doi.org/10.1148/rg.2019180043.

[30] H. J. Shin, H. G. Kim, M.-J. Kim, H. Koh, H. Y. Kim, and et al. Normal range of hepatic fat fraction

on dual- and triple-echo fat quantification MR in children. PLoS ONE, 2015. 10(2):e0117480.

[31] E. O. Stejskal and J. E. Tanner. Spin diffusion measurements: Spin echoes in the presence of a

time-dependent field gradient. The Journal of Chemical Physics, 1965. 42(1):288-292.

[32] J. H. Burdette, D. D. Durden, A. D. Elster, and Y. F. Yen. High b-value diffusion-weighted MRI of

normal brain. J Comput Assist Tomogr, 2001. 25:515-519.

[33] P. B. Kingsley and W. G. Monahan. Selection of the optimum b factor for diffusion-weighted mag-

netic resonance imaging assessment of ischemic stroke. Mag Reson Med, 2004. 51:996-1001.

[34] R. Channel. Difusion weighted imaging - radiology video tutorial (mri), 2015 (accessed April 2019).

https://www.youtube.com/watch?v=YHxi-Juf G0.

[35] J. J. M. van Griethuysen, A. Fedorov, C. Parmar, A. Hosny, N. Aucoin, and et al. Computa-

tional radiomics system to decode the radiographic phenotype, 2017. Cancer Research, 77(21),

e104–e107. https://doi.org/10.1158/0008-5472.CAN-17-0339.

[36] M. G. Bulmer. Principles of Statistics. Dover Books of Mathematics, 1979.

[37] P. H. Westfall. Kurtosis as peakedness, 1905 – 2014. r.i.p. The American statistician, 2014.

68(3):191-195.

[38] M. M. Oken, R. H. Creech, D. C. Tourney, J. Horton, T. E. Davis, and et al. Toxicity and response

criteria of the eastern cooperative oncology group. American Journal of Clinical Oncology, 1982.

5(6), 649-656.

[39] WebMD. Websters’s New WorldTM Medical Dictionary. Wiley Publishing, Inc., Hoboken, New

Jersey, third edition, 2008. ISBN: 978-0-470-18928-3.

[40] Creatinine and creatinine clearance blood tests. WebMD, 2005-2019 (accessed October 2019).

https://www.webmd.com/a-to-z-guides/creatinine-and-creatinine-clearance-blood-tests1.

[41] A. Dasgupta and A. Wahed. Clinical chemistry, immunology and laboratory quality control. Science

Direct, 2014. https://doi.org/10.1016/B978-0-12-407821-5.00013-9.

[42] J. E. Masterson and S. D. Schwartz. The enzymatic reaction catalyzed by lactate dehydrogenase

exhibits one dominant reaction path. Chemical physics, 2014. 442(17):132-136.

[43] M. Fraser and C. Haldeman-Englert. Health Encyclopedia, Latic Acid Dehydrogenase (Blood).

University of Rochester Medical Center, Rochester, New York, 2019 (accessed October 2019).

https://www.urmc.rochester.edu/encyclopedia/content.aspx?contenttypeid=167&contentid=lactic

acid dehydrogenase blood.

63

[44] M. L. Vekaria, B. Rao, and P. Kuriakose. Significance of bone marrow plasma cell percentage

in patients with monoclonal gammopathy of unknown significance developing multiple myeloma.

Blood, 2014. 124(21):5688.

[45] R. K. Loh, S. Vale, and A. McLean-Tooke. Quantitative serum immunoglobulin tests. Australian

family physician, 2013. 42(4):195-8.

[46] J. A. Katzmann, R. J. Clark, R. S. Abranhem, S. Bryant, J. F. Lymp, and et al. Serum reference

intervals and diagnostic ranges for free kappa and free lambda immunoglobulin light chains: relative

sensitivity for detection of monoclonal light chains. Clinical Chemistry., 2002. 48(9):1437-44.

[47] I. M. Foundation. International staging system (iss) and revised iss (r-iss).

https://www.myeloma.org/multiple-myeloma/staging-risk-stratification/international-staging-

system-iss-reivised-iss-r-iss (accessed October 2019).

[48] P. R. Greipp, J. S. Miguel, B. G. Durie, J. J. Crowleu, B. Barlogie, and et al. International staging

system for multiple myeloma. Journal of Clinical Oncology., 2005. 23(15):3412-20.

[49] P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining, chapter 1. Pearson Education,

2006. ISBN 0-321-42052-7.

[50] T. Dejoie, J. Corre, H. Caillon, P. Moreau, M. Attal, and H. A. Loiseau. Response in multi-

ple myeloma should be assigned according to serum, not urine, free light chain measurements.

Leukemia, 2019. 33:313–318.

[51] B. G. Tabachnick and L. S. Fidell. Using Multivariate Statistics, chapter 1. Pearson, sixth edition,

2006. ISBN 978-0-205-89081-1.

[52] R. Ho. Handbook of Univariate and Multivariate Data Analysis and Interpretation with SPSS. Chap-

man Hall/CRC, 2006. ISBN 978-1-584-88602-0.

[53] M. C. Morais. Notas de apoio da disciplina de Probabilidade e Estatıstica. Instituto Superior

Tecnico, 2012.

[54] C. M. R. Kitchen. Nonparametric versus parametric tests of location in biomedical reserch. Ameri-

can journal of ophthalmology., 2009. 147(4):571-572.

[55] B. S. Everitt. The Cambridge Dictionary of Statistics. Cambridge University Press, 2002.

[56] D. J. Sheskin. Handbook of Parametric and Nonparametric Statistical Procedures. Chapman and

Hall/CRC, second edition, 2000.

[57] J. H. McDonald. Handbook Biological Statistics, pages 254–260. Sparky House Publishing, third

edition, 2014. http://www.biostathandbook.com/multiplecomparisons.html.

[58] S. Holm. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics,

Vol. 6, No. 2, pp. 65-70, 1979.

64

[59] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful ap-

proach to multiple testing. Scandinavian Journal of Statistics, Vol. 57, No. 1, pp. 289-300, 1995.

[60] N. Seliya, T. M. Khoshgoftaar, and J. V. H. z. A study on the relationships of classifier performance

metrics. IEEE Computer Society, 2009. doi:10.1109/ICTAI.2009.25.

[61] T. Fawcett. An introduction to roc analysis. Elsevier, 2005. doi:10.1016/j.patrec.2005.10.010.

[62] V. Koutolidis, S. Fontara, E. Terpos, F. Zagouri, D. Matsaridis, and et al. Quantitative diffusion-

weighted imaging of the bone marrow: An adjunct tool for the diagnosis of a diffuse MR imaging

pattern in patients with multiple myeloma. Radiology, 2017. Volume 282: Number 2 - February

2017.

[63] M. Maas, E. M. Akkerman, H. W. Venema, J. Stoker, and G. J. D. Heeten. Dixon quantitative

chemical shift MRI for bone marrow evaluation in the lumbar spine: A reproducibility study in healthy

volunteers. Journal of Computer Assisted Tomography, 2001. 25(5):691–697.

[64] S. V. Rajkumar, M. A. Dimopoulos, A. Palumbo, J. Blade, G. Merlini, and et al. International

myeloma working group updated criteria for the diagnosis of multiple myeloma. The Lancet. Oncol-

ogy, 2014. doi:10.1016/S1470-2045(14)70442-5.

[65] H. Uskudar Teke, M. Basak, M. Kanbay, and et al. Serum level of lactate dehydrogenase is a useful

clinical marker to monitor progressive multiple myeloma diseases: A case report. Turkish Journal

of Hematology, 2014. 31(1):84-87.

[66] M. A. Dimopoulos, B. Barlogie, T. L. Smith, and R. Alexanian. High serum lactate dehydrogenase

level as a marker for drug resistance and short survival in multiple myeloma. Ann. Intern. Med.,

1991. 115(12):931-5.

[67] E. Terpos, E. Katodritou, M. Roussou, A. Pouli, E. Michalis, and et al. High serum lactate dehydro-

genase adds prognostic value to the international myeloma staging system even in the era of novel

agents. Eur. Journal Haematology, 2010. doi: 10.1111/j.1600-0609.2010.01466.x.

[68] N. Hatakeyama, M. Daibata, Y. Nemoto, Y. Ohtsuki, and H. Taguchi1. Lactate dehydrogenase

production and release in a newly established human myeloma cell line. American Journal of

Hematology, 2001. 66:267–273.

[69] C. Messiou, S. Giles, and et al. Assessing response of myeloma bone disease with diffusion-

weighted MRI. The British Journal of Radiology, 2012. 85(1020):e1198-e1203.

[70] J. Hillengass, T. Bauerle, R. Bartl, M. Andrulis, F. McClanahan, and et al. Diffusion-weighted imag-

ing for non-invasive and quantitative monitoring of bone marrow infiltration in patients with mono-

clonal plasma cell disease: a comparative study with histology. British Journal of Haematology,

2011. doi:10.1111/j.1365-2141.2011.08658.x.

65

[71] S. V. Rajkumar, J. L. Harousseau, B. Durie, K. C. Anderson, M. Dimopoulos, and et al. Consensus

recommendations for the uniform reporting of clinical trials: report of the international myeloma

workshop consensus panel 1. Blood., 2011. 117:4691-5.

[72] X. Robin, N. Turck, A. Hainard, N. Tiberti, J.-C. Sanchez, and et al. Display and analyze roc curves.

Technical report, 2019. Package ’pROC’, https://cran.r-project.org/web/packages/pROC/pROC.pdf.

66

A Additional Informations

A.1 IMWG Guidelines

The IMWG uniform response criteria was created from the European Group for Blood and Bone Marrow

Transplant, the International Bone Marrow Transplant Registry and the American Bone Marrow Trans-

plant Registry published criteria, commonly referred to as the Blade criteria or the European Group for

Blood and Bone Marrow Transplant criteria. [71]

Table A.1: International Myeloma Working Group uniform response criteria by response subcategory formultiple myeloma. [Part I] [50]

Response Criteria

Progressive Disease Requires any one or more of the following:

• Increase of 25 % from lowest response value in serum M-protein (ab-solute increase must be ≥ 0.5 g / dL), and/or urine M-protein (abso-lute increase must be ≥ 200 mg / 24 h).

• Only in patients without measurable serum and urine M-protein lev-els: the difference between involved and uninvolved free light chain(FLC) levels (absolute increase must be > 10 mg / dL).

• Only in patients without measurable serum and urine M protein levelsand without measurable disease by FLC levels, bone marrow plasmacells (PCs) percentage (absolute percentage must be ≥ 10 %).

• Definite development of new bone lesions or soft tissue plasmacy-tomas or definite increase in the size of existing bone lesions or softtissue plasmacytomas.

• Development of hypercalcemia (corrected serum calcium > 11.5 mg/ dL) that can be attributed solely to the PC proliferative disorder.

ImmunophenotypicCR

Stringent CR plus absence of phenotypically aberrant PCs (clonal) in BMwith a minimum of 1 million total BM cells analyzed by multiparametric flowcytometry (with > 4 colors).

Molecular CR CR plus negative ASO-PCR, sensitivity 10−5

67

Table A.1: International Myeloma Working Group uniform response criteria by response subcategory formultiple myeloma. [Part II] [50]

Response Criteria

Stringent completeresponse

CR as defined, plus

• Normal FLC ratio, and

• Absence of clonal PCs by immunohistochemistry or 2- to 4-color flowcytometry.

Complete response

• Negative immunofixation of serum and urine, disappearance of anysoft tissue plasmacytomas, and < 5 % PCs in bone marrow.

Very good partialresponse • Serum and urine M-protein detectable by immunofixation but not on

electrophoresis, or

• ≥ 90 % reduction in serum M-protein plus urine M-protein < 100 mg /24 h.

Partial response

• ≥ 50 % reduction of serum M-protein and reduction in 24-hour urinaryM-protein by ≥ 90 % or to < 200 mg / 24 hours.

• If the serum and urine M-protein are not measurable, a decrease ≥50 % in the difference between involved and uninvolved FLC levels isrequired in place of the M-protein criteria.

• If serum and urine M-protein are not measurable, and serum free lightassay is also not measurable, ≥ 50 % reduction in bone marrow PCsis required in place of M-protein, provided baseline percentage was ≥30 %.

• In addition to the above criteria, if present at baseline, ≥ 50 % reduc-tion in the size of soft tissue plasmacytomas is also required.

Stable disease Not meeting criteria for CR, VGPR, PR, or PD.

68

B Data Sets

Table B.1: Response to the induction therapy in a numeric scale: 1 stringent complete response. 2complete response. 3 very good partial response. 4 partial response. 5 stable disease and 6 progressivedisease. The columns named scenario 1, scenario 2 and scenario 3 explicitly portray the class thateach patient is placed in, responder or non-responder. In scenario 3, the patients excluded from thestudy receive the designation excluded.

ID Response to Induction Therapy Scenario 1 Scenario 2 Scenario 3

1 3 responder responder responder

2 6 non-responder non-responder excluded


4 4 responder non-responder non-responder



























69

C Materials and Methods

C.1 R code

C.1.1 Adjusted p-values

The function that returns the adjusted p-values resorting to the function p.adjust was written in R. Dif-

ferent multiple comparison corrections can be specified. As one can see, the Bonferroni correction

(”bonferroni”), the Holm method (”holm”) and the BH procedure (”BH”) are the corrections considered.

By the order presented, the corrections considered are sequentially less conservative. The p-values

obtained are being returned with 6 decimal places.

rm main = function ( data ) {

data$pvalue bon fe r ron i <− round ( p . ad jus t ( data$p value , ” bon fe r ron i ” ) , 6)

# Ca l cu l a t i on o f the adjusted p−value using the Bonfe r ron i c o r r e c t i o n

data$pvalue holm <− round ( p . ad jus t ( data$p value , ” holm ” ) , 6)

# Ca l cu l a t i on o f the adjusted p−value using the Holm method

data$pvalue BH <− round ( p . ad jus t ( data$p value , ”BH” ) , 6)

# Ca l cu l a t i on o f the adjusted p−value using the BH procedure

return ( data )

}

C.1.2 Generation of the ROC curves

All the information regarding the package pROC is available in the CRAN repository. [72] The code is

commented for an easier understanding of the functions, cycles and conditions written.

ROC = function ( data ){

l i b r a r y (pROC) #package wi th the f u n c t i o n ” roc ”

par ( p ty = ” s ” ) #squares the graph ic generated

i <− 2 #column number , sk ip ID

z <− colnames ( data [ , 2 : ( ncol ( data ) −2) ] ) # vec to r w i th the a t t r i b u t e s names

auc <− vector ( )

71

roc . df <− c ( 1 : 3 1 )

while ( i <=(ncol ( data )−2)) { #2−19 a t t r i b u t e s

#choose p o s i t i v e c lass based on c lass p ropo r t i on

# l e v e l s = c ( ” negat ive c lass ” , ” p o s i t i v e c lass ” )

#response i s 1 and non−response i s 0

# c a l c u l a t i o n o f the ROC curve parameters f o r a t t r i b u t e i n the p o s i t i o n ” i ”

#from the ” data ”

i f (mean( as . vector ( un l is t ( data [ , ncol ( data ) ] ) ) ) > 0 .5 ) { #scenar io 1

roc . i n f o <− roc ( data [ , ncol ( data ) ] , data [ , i ] , legacy . axes=TRUE,

levels=c ( 1 , 0 ) , plot=TRUE, x lab= ”FPR” , y lab= ”TPR” , col= ” #377eb8 ” ,

lwd =2 , pr in t . auc = TRUE, pr in t . auc . x = 0.45 , pr in t . auc . y = 0.12)

} else i f (mean( as . vector ( un l is t ( data [ , ncol ( data ) ] ) ) ) <=0.5) { #scenar io 2

roc . i n f o <− roc ( data [ , ncol ( data ) ] , data [ , i ] , legacy . axes=TRUE,

levels=c ( 0 , 1 ) , plot=TRUE, x lab= ”FPR” , y lab= ”TPR” , col= ” #377eb8 ” ,

lwd =2 , pr in t . auc = TRUE, pr in t . auc . x = 0.45 , pr in t . auc . y = 0.12)

}

j <− i−1

auc [ j ] <− auc ( roc . i n f o ) # r e t r i e v e s AUC f o r the a t t r i b u t e i n the p o s i t i o n ” i ”

#from the ” data ”

pr in t ( c i . auc ( roc . i n f o ) ) # p r i n t s DeLong conf idence i n t e r v a l s r e f e r e n t to the

#AUC value

# c rea t i on o f a data frame wi th the TPR, FPR and th resho ld f o r the a t t r i b u t e i n

# the p o s i t i o n ” i ” from the ” data ”

i . df <− data . frame ( var iable = rep ( colnames ( data [ i ] ) ) ,

TPR=roc . i n f o $ s e n s i t i v i t i e s , FPR=(1− roc . i n f o $ s p e c i f i c i t i e s )

th resho lds= roc . i n f o $ th resho lds , d i r e c t i o n = roc . i n f o $ d i r e c t i o n ,

s t r i ngsAsFac to rs=FALSE)

# d i f f e r e n t a t t r i b u t e s have d i f f e r e n t thresho lds , i n order to create the data

#frame wi th a l l the a t t r i b u t e s , a l l the i n d i v i d u a l data frames need to have the

#same leng th

while (nrow ( i . df )<31) { # f i l l i n g the remaining rows

r <− rep ( ”NA” )

i . df <− rbind ( i . df , r )

}

roc . df <− cbind ( roc . df , rep ( ” x ” ) , i . df ) # j u n c t i o n o f the data frame f o r the

72

# a t t r i b u t e i n the p o s i t i o n ” i ” from the ” data ” to the major data frame

i <− i +1

}

zauc <− cbind ( z , auc ) # mat r i x w i th the a t t r i b u t e s names and respec t i ve AUC values

}

73

D Results

D.1 ROC Curves Analysis

To all the attributes with an AUC over 0.70 was conducted a posterior evaluation of their ROC curve.

Some attributes only showed statistical significance after the Mann-Whitney U-Test but did not survive

the multiple comparison corrections, indicating that the first result could have been obtained only by

chance. Therefore, these attributes were excluded from a posterior analysis as key attributes.

In the Table D.1 are depicted the attributes found with the considered best pairs of TPR and FPR for

the different scenarios.

Since each scenario has a different class proportion, the comparison of the performance metrics

between them may not be straight forward. Also, as the positive class changes accordingly with the

scenario considered, thus the initial purpose of maximizing the TPR or minimizing the FPR is also

dependent on the scenario considered, as explained previously in the Subsection 4.1.1. When choosing

an attribute to pursuit in a more extensive study, these issues should be taken into consideration.

Table D.1: Summary of the attributes found statistically interesting when considering the AUC analysiswith their respective true positive rate (TPR), false positive rate (FPR), threshold at which this rates areverified, accuracy (acc), precision (pre), F1-measure (F1), AUC (with the respective 95% confidenceinterval) and p-values before and after the multiple comparison tests.

Scenario Data set Attribute AUC TPR FPR Threshold Acc Pre F1

1

ADC PreTreat Kurtosis 0.855 0.667 0.087 3.30 0.862 0.667 0.667

ADC PostTreat rMAD 0.810 0.800 0.143 111 0.846 0.571 0.667

FF PreTreat

90 Percentile 0.879 0.667 0.091 19.0 0.857 0.667 0.667

Median 0.856 0.833 0.182 11.0 0.821 0.556 0.667

RMS 0.856 0.833 0.182 17.4 0.821 0.556 0.667

Skewness 0.856 0.883 0.182 0.624 0.821 0.556 0.667

Total Energy 0.864 0.883 0.182 2.72×107 0.821 0.556 0.667

FF PostTreat Skewness 0.853 0.600 0 0.70 0.917 1.000 0.750

2

ADC Delta MAD 0.800 0.800 0.125 25.2 0.846 0.800 0.800

FF Delta10 Percentile 0.804 0.900 0.385 - 0.50 0.739 0.643 0.750

Mean 0.785 0.900 0.385 - 0.60 0.739 0.643 0.750

3

ADC Delta MAD 0.782 0.800 0.091 25.2 0.857 0.889 0.842

FF Delta

10 Percentile 0.919 0.875 0.100 -0.5 0.889 0.875 0.875

90 Percentile 0.906 1.000 0.200 0.5 0.889 0.800 0.889

Mean 0.900 1.000 0.200 - 0.122 0.889 0.800 0.889

RMS 0.888 1.000 0.200 - 0.100 0.889 0.800 0.889

75

D.2 ROC Curves

Figure D.1: ROC curve for the attribute Kurtosis inthe ADC pre-treatment data set in the scenario 1,with a correspondent AUC value of 0.855 (0.679-1.000).

Figure D.2: ROC curve for the attribute 90 Percentilein the FF pre-treatment in the scenario 1, with a cor-respondent AUC value of 0.879 (0.747-1.000).

Figure D.3: ROC curve for the attribute Median inthe FF pre-treatment in the scenario 1, with a corre-spondent AUC value of 0.856 (0.698-1.000).

Figure D.4: ROC curve for the attribute Root MeanSquares in the FF pre-treatment data set in the sce-nario 1, with a correspondent AUC value of 0.856(0.704-1.000).

76

Figure D.5: ROC curve for the attribute Skewness inthe FF pre-treatment data set in the scenario 1, witha correspondent AUC value of 0.856 (0.702-1.000).

Figure D.6: ROC curve for the attribute Total Energyin the FF post-treatment data set in the scenario 1,with a correspondent AUC value of 0.864 (0.703-1.000).

77

D.3 Box and Whiskers Plots

Table D.2: Statistical parameters concerning the design of the Box and Whiskers plots for the key at-tributes and clinical variables. The classes are identified as responders and non-responders correspond-ing to the class of patients that are considered to respond and not respond to treatment, respectively.

Statistical ParametersKurtosis (ADC PreTreat 1) 90 Percentile (FF PreTreat 1)

Responders Non-Responders Responders Non-Responderslower extreme 2,12 2,82 21 5Q1 2,57 2,96 36 10median 2,76 3,48 45 16Q3 3,06 3,68 46 32upper extreme 3,56 4,56 47 40number of observations 22 6 22 6outliers 10, 13, 20

Statistical ParametersMedian (FF PreTreat 1) RMS (FF PreTreat 1)

Responders Non-Responders Responders Non-Responderslower extreme 0 0 6 4.1Q1 18 0 22 6.9median 34 2.50 34 9.5Q3 38 5.00 36 15.0upper extreme 41 5.00 40 15.0number of observations 22 6 22 6outliers 31 31

Statistical ParametersSkewness (FF PreTreat 1) Total Energy (FF PreTreat 1)

Responders Non-Responders Responders Non-Responderslower extreme -1.22 -0.35 5330628 1488935Q1 -1.02 0.85 53686528 5280098median -0.58 1.55 166590529 10681660Q3 0.09 2.51 197928989 18283058upper extreme 1.69 3.47 249195096 18283058number of observations 22 6 22 6outliers 2.20, 2.24 151806015

Statistical ParametersLHD (Clinical 1) Serum IgA (Clinical 1)

Responders Non-Responders Responders Non-Responderslower extreme 100 144 23 25Q1 137 197 26 26.5median 148 227 29 32.5Q3 167.5 272 86 41upper extreme 194 277 171 45number of observations 23 6 21 4outliers 178, 541, 3180

78

Prediction of Treatment Response in Patients with Multiple ...

Documents

Transcript of Prediction of Treatment Response in Patients with Multiple ...