Prediction of Treatment Response in Patients with Multiple ...
Transcript of Prediction of Treatment Response in Patients with Multiple ...
Prediction of Treatment Response in Patients with MultipleMyeloma undergoing Chemotherapy using MRI derived
Imaging Biomarkers
Renata Isabel Jónatas Quintino
Thesis to obtain the Master of Science Degree in
Biological Engineering
Supervisor(s): Dr. Nickolas PapanikolaouProf. Susana de Almeida Mendes Vinga Martins
Examination Committee
Chairperson: Prof. Maria Margarida Fonseca Rodrigues DiogoSupervisor: Dr. Nickolas Papanikolaou
Member of the Committee: Dr. Vasilios Koutoulidis
November 2019
ii
The work presented in this thesis was performed at Computational Clinical Imaging Group of Cham-
palimaud Foundation (Lisbon, Portugal), during the period February-October 2019, under the supervi-
sion of Dr. Nickolas Papanikolaou. The thesis was co-supervised at Instituto Superior Tecnico by Prof.
Susana Vinga.
Furthermore, I declare that this document is an original work of my own authorship and that it fulfills
all the requirements of the Code of Conduct and Good Practices of the Universidade de Lisboa.
iii
iv
Acknowledgments
Definitely, there are three people that I must thank enormously for all their patience throughout my ups
and downs during this challenging times in a student’s life. These three people are always ready to bring
me up and ground me down, to listen to all my problems and offer helpful advice or just a most needed
shoulder to lean on. They celebrate all my achievements with the such pride and joy, and this one is
tremendously dedicated to them, the end of an era. To my lovely mom, my wise dad and my goofy
brother.
At the Champalimaud Foundation I had my fair share of helping hands: my thesis advisor Dr. Nickolas
Papanikolaou, PhD Joao Santinha and PhD Jose Moreira, a well deserved thank you. A gentle reminder
to the rest of the team .
To Professor Ana Azevedo and to the PhD Mariana Ferreira, whom with incredible kindness guide
me to Professor Susana Vinga, who kindly accepted to be my thesis advisor. To these three amazing
ladies a big thank you.
To the institutions, namely Escola Secundaria Mouzinho da Silveira and Tecnico Lisboa, that gave
me the tools to come so far along in this journey and to all the professors that made an impact in me to
become the student I am today.
An enormous thank you to an amazing couple that has always available to help me inside and outside
of the Champalimaud Foundation, Graca and Paulo. Without you probably I would not have ended up
doing this thesis.
To my friends back home, who accompanied me throughout this journey and always gave me a
sense of home even when physically far.
To my friends in Biological Engineering and in life that accompanied me throughout these relentless
and amazing five years of Tecnico, a big thank you. To the ones I share great memories with! Namely,
Carolina Richheimer, Isabel Doutor, Margarida Rodrigues, Pedro Pereira, Sofia Amorim, Simone Gorny
and Tiago Taborda.
To my Palazzo Lombardia friends, whom I will never forget after being a huge part of one of the
most amazing experiences of my life, which made me grew up so much. A big thank to Ana Bordignon,
Anna Rita Carvalho, Amanda Coelho, Amanda Queirante, Botond Gazda, Carlos Maranghetti, Come
de Tugny, Daniela Oliveira, Gianluigi Quaglia, Jan Witte, Lauren Astruc, Leen Leconte, Marta Lo Presti,
Max Komorek, Sophie Vermeire and Unnie Marie Tvedt.
Last but not least, a gigantic thank you to my big C1 family. Living with so many people is not easy
at times, but all theses amazing human beings, that I have the pleasure to share a roof with, where the
omnipresent support that I am truly thankful for, throughout my struggles and throughout my conquests.
These people contributed for my happiness over five wonderful years, all of you have a very special
place in my heart. A special thank you to Alice Lourenco, Beatriz Filipe, Afonso Luz, Carlos Pires,
Diogo Pires, Goncalo Cardoso Iara Figueiras, Ines Rainho, Leonardo Pedroso, Joao Nunes, Joao Pedro
Gomes, Maria Mesquita, Matheus Orsi, Miguel Rebocho, Pedro Pereira, Sara Cardoso, Solange Bolas,
Steven Santos and Tiago Costa.
v
vi
Resumo
Tecnicas de imagiologia estao a ser cada vez mais usadas na avaliacao de mieloma multiplo (MM). O
principal objetivo deste trabalho e a utilizacao de imagens de ressonancia magnetica para descobrir um
biomarcador preciso que possa auxiliar na previsao da resposta ao tratamento em paciente com MM.
Imagens de ressonancia magnetica foram recolhidas de 30 pacientes com MM, antes e apos o
primeiro ciclo de tratamento por quimioterapia. Estatısticas de primeira ordem que descrevem a distribuicao
da intensidade do sinal foram extraıdas dos mapas de coeficiente de difusao aparente e fracao de gor-
dura gerados a partir das imagens de difusao ponderada e sequencias gradiente eco, respetivamente.
Estas variaveis foram submetidas a uma analise univariada atraves do Mann-Whitney U-Test para
avaliar diferencas com significancia estatıstica entre as duas populacoes de pacientes, as quais diferem
na resposta ao tratamento. Estes resultados foram submetidos a correcoes de comparacao multipla.
Paralelamente, foram analisadas as curvas ROC (receiver operating characteristic) para discriminar os
atributos que demonstram o melhor compromisso entre sensibilidade e especificidade.
Varias variaveis demonstraram um bom poder discriminatorio entre as duas populacoes, assim
como bons valores nas medidas de performance. Os melhores resultados sao extraıdos dos mapas
recolhidos antes do tratamento, com uma sensibilidade de 66.7% e uma especificidade de 90.9% ou
com uma sensibilidade de 83.3% e uma especificidade de 81.8%.
Este trabalho contribui para a valorizacao do potencial das variaveis recolhidas de imagens de res-
sonancia magnetica serem biomarcadores precisos na previsao da resposta ao tratamento do MM.
Palavras-chave: ressonancia magnetica, fracao de gordura, coeficiente aparente difusao,
mieloma multiplo, biomarcadores, resposta ao tratamento.
vii
viii
Abstract
Imaging techniques are being increasingly used in the evaluation of multiple myeloma (MM). The main
objective of this work is to explore magnetic resonance images to discover accurate imaging biomarkers
that can aid in an early prediction of response to treatment in patients with MM.
Magnetic resonance images from the spine of 30 patients with MM were collected, before and after
the first cycle of induction chemotherapy. First order statistics that describe the distribution of signal
intensity were extracted from apparent diffusion coefficient (ADC) and fat fraction (FF) maps, generated
from diffusion weighted imaging and in and opposed-phase gradient echo magnetic resonance images,
respectively.
These imaging features were submitted to an univariate analysis with a Mann-Whitney U-Test to
evaluate statistical significant differences between the two populations of patients, which differ in re-
sponse to treatment (responders and non-responders) in three different scenarios. These results were
posteriorly submitted to multiple comparison corrections. In parallel, in other to discriminate attributes
that displayed the best balance between sensitivity and specificity, the receiver operating characteristic
curves (ROC) were analysed.
Several features demonstrated a good discrimination between responders and non-responders, as
well as good performance metrics. The best attributes were extracted from the maps created before the
beginning of induction treatment with specificity and sensitivity equal or superior to 81.8% and 66.7%,
respectively.
This work sediments the potential of imaging features collected from magnetic resonance images as
accurate biomarkers in the prediction of treatment response in MM.
Keywords: magnetic resonance imaging, apparent diffusion coefficient, fat fraction, multiple
myeloma, imaging biomarkers, treatment response.
ix
x
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
List of acronyms and abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Topic Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Theoretical Background 7
2.1 Image Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 T1-weighted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Short-Time Inversion Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 In and Opposed Phase Gradient Fast Field Echo . . . . . . . . . . . . . . . . . . . 8
2.1.4 Diffusion-Weighted Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Clinical Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 Statistical Analysis Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6.1 Univariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6.2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Materials and Methods 23
3.1 Patient Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 MRI Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Statistical Analysis Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.3 Univariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
xi
4 Results 27
4.1 Univariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.1 p-value and ROC Curve Evaluation for the First Order Imaging Features . . . . . . 27
4.1.2 Detailed Analysis of the Key Attributes . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1.3 Clinical Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Density Plots Comparative Analysis Over the First Round of Chemotherapy . . . . . . . . 48
5 Discussion 53
6 Conclusions 59
Bibliography 61
A Additional Informations 67
A.1 IMWG Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
B Data Sets 69
C Materials and Methods 71
C.1 R code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
C.1.1 Adjusted p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
C.1.2 Generation of the ROC curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
D Results 75
D.1 ROC Curves Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
D.2 ROC Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
D.3 Box and Whiskers Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
xii
List of Tables
2.1 General form of a confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1 p-values achieved with the Mann-Whitney U-Test (MW U-Test) before and after the imple-
mentation of the multiple comparison corrections and the AUC value obtained using the
ROC curve, with the respective 95% confidence interval, regarding each attribute of the
pre-treatment data set in scenario 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 p-values achieved with the Mann-Whitney U-Test (MW U-Test) before and after the imple-
mentation of the multiple comparison corrections and the AUC value obtained using the
ROC curve, with the respective 95% confidence interval, regarding each attribute of the
pre-treatment data set in scenario 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 p-values achieved with the Mann-Whitney U-Test (MW U-Test) before and after the imple-
mentation of the multiple comparison corrections and the AUC value obtained using the
ROC curve, with the respective 95% confidence interval, regarding each attribute of the
pre-treatment data set in scenario 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4 p-values achieved with the Mann-Whitney U-Test (MW U-Test) before and after the imple-
mentation of the multiple comparison corrections and the AUC value obtained using the
ROC curve, with the respective 95% confidence interval, regarding each attribute of the
post-1stcycle data set in scenario 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5 p-values achieved with the Mann-Whitney U-Test (MW U-Test) before and after the imple-
mentation of the multiple comparison corrections and the AUC value obtained using the
ROC curve, with the respective 95% confidence interval, regarding each attribute of the
post-1stcycle data set in scenario 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.6 p-values achieved with the Mann-Whitney U-Test (MW U-Test) before and after the imple-
mentation of the multiple comparison corrections and the AUC value obtained using the
ROC curve, with the respective 95% confidence interval, regarding each attribute of the
post-1stcycle data set in scenario 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.7 p-values achieved from the Mann-Whitney U-Test for each attribute of interest before and
after the implementation of the multiple comparison corrections and the AUC value ob-
tained using the ROC curve, with the respective 95% confidence interval, regarding each
attribute of the delta data set in scenario 1. . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.8 p-values achieved from the Mann-Whitney U-Test for each attribute of interest before and
after the implementation of the multiple comparison corrections and the AUC value ob-
tained using the ROC curve, with the respective 95% confidence interval, regarding each
attribute of the delta data set in scenario 2. . . . . . . . . . . . . . . . . . . . . . . . . . 37
xiii
4.9 p-values achieved from the Mann-Whitney U-Test for each attribute of interest before and
after the implementation of the multiple comparison corrections and the AUC value ob-
tained using the ROC curve, with the respective 95% confidence interval, regarding each
attribute of the delta data set in scenario 3. . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.10 Summary of the attributes considered statically relevant through the p-value evaluation
and the ROC curve with their respective true positive rate (TPR), false positive rate (FPR),
threshold at which this rates are verified, accuracy, precision, F1-measure (F1), AUC (with
the respective 95% confidence interval) and p-values before and after the multiple com-
parison tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.11 Mean associated with the key attributes for each class (responders and non-responders),
with the respective standard deviation (SD), and the range of values within each key
attribute is located. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.12 p-values achieved from the Mann-Whitney U-Test for each attribute of interest before and
after the implementation of the multiple comparison corrections and the AUC value ob-
tained using the ROC curve, with the respective 95% confidence interval, regarding each
attribute of the clinical data set in scenario 1. . . . . . . . . . . . . . . . . . . . . . . . . 44
4.13 p-values achieved from the Mann-Whitney U-Test for each attribute of interest before and
after the implementation of the multiple comparison corrections and the AUC value ob-
tained using the ROC curve, with the respective 95% confidence interval, regarding each
attribute of the clinical data set in scenario 2. . . . . . . . . . . . . . . . . . . . . . . . . 45
4.14 p-values achieved from the Mann-Whitney U-Test for each attribute of interest before and
after the implementation of the multiple comparison corrections and the AUC value ob-
tained using the ROC curve, with the respective 95% confidence interval, regarding each
attribute of the clinical data set in scenario 3. . . . . . . . . . . . . . . . . . . . . . . . . 46
4.15 Summary of the attributes considered statically relevant through the Mann-Whitney U-
Test and the ROC curve with their respective true positive rate (TPR), false positive rate
(FPR), threshold at which this rates are verified, accuracy, precision, F1-measure (F1),
AUC score with the respective 95% confidence interval and p-values before and after the
multiple comparison tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.16 Mean associated with the clinical attributes LDH and serum immunoglobulin A (Serum
IgA) for each class (responders and non-responders), with the respective standard devi-
ation (SD), and the range of values within each attribute is located. . . . . . . . . . . . . . 47
4.17 Statistical metrics summary that characterize the density plots depicted in the images 4.4
to 4.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
A.1 International Myeloma Working Group uniform response criteria by response subcategory
for multiple myeloma. [Part I] [50] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
A.1 International Myeloma Working Group uniform response criteria by response subcategory
for multiple myeloma. [Part II] [50] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
xiv
B.1 Response to the induction therapy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
D.1 Summary of the attributes found statistically interesting when considering the AUC analysis. 75
D.2 Statistical parameters concerning the design of the Box and Whiskers plots for the key at-
tributes and clinical variables. The classes are identified as responders and non-responders
corresponding to the class of patients that are considered to respond and not respond to
treatment, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
xv
xvi
List of Figures
1.1 MRI image (in-phase gradient echo) from a patient’s spine. A Original image and B Seg-
mented image with the regions of interest (spine’s vertebrae) filled with a red label. . . . . 4
3.1 General process used for the collection of the p-values in the univariate analysis. . . . . . 26
4.1 Comparison of differences in signal intensity parameters collected from ADC and FF maps
between responders and non-responders for the key attribute A Kurtosis in the ADC pre-
treatment data set in scenario 1 (ADC PreTreat 1), B 90 Percentile in the FF pre-treatment
data set in scenario 1 (FF PreTreat 1), C Median in the FF pre-treatment data set in
scenario 1, D Skewness in the FF pre-treatment data set in scenario 1, E Root Mean
Squared in the FF pre-treatment data set in scenario 1 and F Total Energy in the FF
pre-treatment data set in scenario 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Box and whisker plot for the clinical attributes A LDH and B Serum Immunoglobulin A
(Serum IgA) in scenario 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Box and whisker plot for the clinical attribute Serum Immunoglobulin A (Serum IgA) in
scenario 1 amplified. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Density plot relative to the signal intensities removed from the ADC maps from patient 2,
whom presents a response classification of 6 to the induction therapy. . . . . . . . . . . 51
4.5 Density plot relative to the signal intensities removed from the FF maps from patient 2,
whom presents a response classification of 6 to the induction therapy. . . . . . . . . . . 51
4.6 Density plot relative to the signal intensities removed from the ADC maps from patient 6,
whom presents a response classification of 4 to the induction therapy. . . . . . . . . . . 51
4.7 Density plot relative to the signal intensities removed from the FF maps from patient 6,
whom presents a response classification of 4 to the induction therapy. . . . . . . . . . . 51
4.8 Density plot relative to the signal intensities removed from the ADC maps from patient
22, whom presents a response classification of 1 to the induction therapy. . . . . . . . . 51
4.9 Density plot relative to the signal intensities removed from the FF maps from patient 22,
whom presents a response classification of 1 to the induction therapy. . . . . . . . . . . 51
D.1 ROC curve for the attribute Kurtosis in the ADC pre-treatment data set in the scenario 1,
with a correspondent AUC value of 0.855 (0.679-1.000). . . . . . . . . . . . . . . . . . . . 76
D.2 ROC curve for the attribute 90 Percentile in the FF pre-treatment in the scenario 1, with a
correspondent AUC value of 0.879 (0.747-1.000). . . . . . . . . . . . . . . . . . . . . . . . 76
D.3 ROC curve for the attribute Median in the FF pre-treatment in the scenario 1, with a
correspondent AUC value of 0.856 (0.698-1.000). . . . . . . . . . . . . . . . . . . . . . . . 76
xvii
D.4 ROC curve for the attribute Root Mean Squares in the FF pre-treatment data set in the
scenario 1, with a correspondent AUC value of 0.856 (0.704-1.000). . . . . . . . . . . . . 76
D.5 ROC curve for the attribute Skewness in the FF pre-treatment data set in the scenario 1,
with a correspondent AUC value of 0.856 (0.702-1.000). . . . . . . . . . . . . . . . . . . . 77
D.6 ROC curve for the attribute Total Energy in the FF post-treatment data set in the scenario
1, with a correspondent AUC value of 0.864 (0.703-1.000). . . . . . . . . . . . . . . . . . 77
xviii
List of acronyms and abbreviations
ROC Receiver Operating Characteristic
α Significance Level
β2M Beta-2 Microglobulin
ADC Apparent Diffusion Coefficient
ADC Apparent Diffusion Coefficient
AUC Area Under the Curve
BH Benjamini-Hochberg
CR Complete Response
DWI Diffusion Weighted Imaging
ECOGPS Eastern Cooperative Oncology Group Performance Status
FDR False Discovery Rate
FFE Fast Field Echo
FF Fat Fraction
FLC Free Light Chain
FN False Negative
FPR False Positive Rate
FP False Positive
FWER Family-Wise Error Rate
GRE Gradient Echo
Hb Hemoglubin
IMWG International Myeloma Working Group
IQR Interquartile Range
MAD Mean Absolute Deviation
MM Multiple Myeloma
xix
MRI Magnetic Resonance Imaging
PCs Plasma Cells
PD Progressive Disease
PMNs Polymorphonuclear leukophils
PR Partial Response
RF Radio Frequency
RMS Root Mean Squared
ROI Region of Interest
SD Stable Disease
SD Standard Deviation
SE Spin Echo
STIR Short-Time Inversion Recovery
Serum IgA Serum Immunoglobulin A
TE Echo Time
TI Inversion Time
TPR True Positive Rate
TP True Negative
TP True Positive
TR Repetition Time
VGPR Very Good Partial Response
VOI Volume of Interest
WBC White Blood Cell
rMAD Robust Mean Absolute Deviation
xx
1 Introduction
1.1 Motivation
Chemotherapy is a debilitating procedure that patients with multiple myeloma commonly undergo to fight
the cancer. It is not certain that the treatment will have a successful outcome, since it is impossible to
predict exactly how the patient will respond to it. The existence of a non-invasive imaging biomarker
that could aid in the prediction of the treatment’s outcome, would indicate if chemotherapy is the path
to be followed and, if not, other alternatives could be explored without submitting the patient to this
particular exhausting procedure. If the predictions are accurate and they could be made in an early
treatment stage, time and money could be saved and, most importantly, patient care could be improved
by avoiding unnecessary treatment.
1.2 Topic Overview
According with the National Cancer Institute, the basic definition of cancer is a collection of diseases in
which abnormal cells are able to grow indefinitely and may spread into nearby tissues. [1]
The total number of cancer deaths continues to raise due to the increasing population size, life
expectancy and population mean age. [2] In 2016, cancer was the sixth leading cause of death in the
world according with the World Health Organization. This is a consequence of the increasing number
of cancer patients, as well as of the progresses made against other death causing diseases, such as
HIV/AIDS or tuberculosis. [3]
Multiple myeloma (MM) is characterized by the proliferation and accumulation of monoclonal plasma
cells. [4] As defined by the Canadian Cancer Society, plasma cells are a type of white blood cells that
secrete large volumes of antibodies. These cells are an important part of the immune system and can be
found in the bone marrow. The abnormal plasma cells, also known as myeloma cells, can form tumours
in the bones. A single tumour formed by myeloma cells receives the name of plasmacytoma; if several
plasmacytomas are found, the condition is called multiple myeloma. [5] The abnormal plasma cells that
characterize this cancer can be distributes in the bone marrow either focally and/or in a diffuse manner.
[6]
The overgrowth of plasma cells may lead to a low blood count, which may result in anemia (a shortage
of red blood cells), thrombocytopenia (a low level of platelets) and leukopenia (a shortage of normal white
blood cells). The main consequences of these conditions are weakness and fatigue, increased bleeding
and bruising and a frail immune response, respectively.
The existence of myeloma cells also interferes in the bone tissue replacement. The osteoclasts and
osteoblasts work together in order to produce healthy and strong bones. The myeloma cells induce
1
the osteoclasts to dissolve the existing bone, without a proper follow up by the osteoblasts, which are
responsible for the synthesis of new bone. This imbalance leads to the formation of weak bones and to
an increase of the calcium concentration in the blood stream (hypercalcemia). [4]
As a last remark, the out of control growth of the abnormal plasma cells leads to the dominant
production of a single immunoglobulin, monoclonal immunoglobulin. This monoclonal immunoglobulin,
recurrently called M-protein, may be used to measure the tumour load. Due to the predominance of a
single antibody and to the very inefficient production of others, the body is unable to protect itself against
infections. [7] The kidneys are also commonly affected by the production of M-protein, leading to kidney
damage or failure (renal impairment). [4]
This disease may develop from an asymptomatic premalignant stage - characterized by the exis-
tence of abnormal cells that increase the chance of evolving into cancer, to monoclonal gammopathy
of undetermined significance - characterized by the presence of M-protein, over to smouldering multiple
myeloma - characterized by the detection of abnormal cells in the bone marrow and abnormal protein
in the blood and/or urine in asymptomatic patients, and finally evolve to symptomatic multiple myeloma
with end-organ damage, such as renal impairment, hypercalcemia, anaemia and bone disease, as men-
tioned above. [8]
Based on the International Myeloma Working Group (IMWG) diagnosis criteria reported in 2014, the
diagnosis for MM relies on the demonstration of bone marrow plasmacytosis and/or on the presence of
M-proteins in the serum or urine and/or the detection of end-organ damage, with particular attention to
bone lesions. [8]
Although multiple myeloma remains incurable, novel therapies based on drug combinations have
allowed to prolong survival. In the early 1960s, melphalan was introduced as an anti-cancer chemother-
apy drug. This type of drug works by damaging the genetic material of the cell responsible for the
cellular division. [9] For patients responsive to melphalan, the treatment resulted in more than 2 years
survival increase. In the late 1980s, this treatment was combined with autologous bone marrow trans-
plantation, which led to an overall survival greater than 3 years. Nowadays, it is commonly used a
3-drug combination consisting on an immunomodulatory drug, a proteasome inhibitor and glucocor-
ticoid, followed by autologous stem cell transplantation maintenance therapy with low-dose of an im-
munomodulatory drug or of a proteasome inhibitor. The immunomodulatory drugs are responsible
for pleiotropic anti-myeloma properties including immune-modulation, anti-angiogenic, anti-inflammatory
and anti-proliferative effects. The proteosome inhibitor, by blocking the action of proteasome, prevent
the degradation of pro-apoptotic factors. The glucocorticoid is part of the feedback mechanism in the
immune system related with exaggerated responses, this way it can be used to reduce inflammation.
Recently, immunotherapy as gain relevance in myeloma therapy. Nevertheless, the treatments eventu-
ally fail due to acquired drug resistance and clonal evolution. Genetic changes were detected between
the time of diagnosis and relapse, what suggests the occurrence of expansion of low frequency resistant
clones at diagnosis or the appearance of mutations as the disease progresses. These findings imply
that the best path to follow should be individualized treatment for maximum survival benefit. [10]
Magnetic resonance imaging (MRI) is a non invasive method that can be adopted to evaluate the
2
existence of alterations in bone marrow composition. It is used in diagnosis work-up of patients with MM
and to qualify the development of the disease during and after treatment. [11]
The lesions provoked by the disease have high cellularity and high water content, which contrast
with the normal bone marrow composition, therefore these lesions present a different signal intensity
when compared with healthy bone marrow. Different types of MRI are used to establish a more in-
formed opinion about staging and response to treatment, such as conventional signal echo (SE) and
diffusion weighted imaging (DWI). The conventional MRI provides morphological information regarding
the detection of focal lesions in patients with MM, while DWI provides additional functional information.
Conventional MRI is used to assess infiltration patterns and DWI evaluates the bone marrow composition
and cellularity. [8]
The parametric maps that can be generated from magnetic resonance images have been explored in
order to find a reliable and precise biomarker. Some studies regarding the apparent diffusion coefficient
(ADC) maps have been conducted and they indicate the existence of a relation between the treatment’s
outcome and the evolution in signal intensity during treatment. [12] Furthermore, there are also some
studies that support the use of fat fraction (FF) maps as an excellent biomarker. [13] [14]
In multiple myeloma can be identified five different patterns of bone marrow infiltration: a normal
appearing marrow, focal infiltration, diffuse disease, salt-and-pepper involvement or combined focal and
diffuse infiltration. The focal lesions are characterized by circumscribed areas with a distinct signal inten-
sity from the surrounding vertebrae bodies. The diffuse infiltration is characterized by a homogeneous
shift on the signal intensity from the normal appearance of the bone marrow. [6] The focus and diffuse
infiltration pattern is characterized by an homogeneous shift on signal intensity with additional foci inter-
spersed. The salt-and-pepper pattern is characterized by an inhomogeneous shift on signal intensity.
[8] The signal intensity shift on damaged areas is dependent on the technique used. For instance, since
there is an increase on the water and decrease on the fat content on the abnormal areas, on T1-weighted
images there is a signal’s intensity decrease, while on T2-weighted images with fat suppression there is
a signal intensity increase. Both these techniques are elucidated in section 2.1.
The IMWG, the National Comprehensive Cancer Network and the European Society for Medical
Oncology recommend a skeletal survey study for MM diagnosis. [15] [16] Nevertheless, the images
provided by the University Hospital of Athens, Greece, are only spine surveys focusing on the lumbar
region. Although it would be more precise to do a whole body survey, this region is thought to be
representative of the alteration occurring in the bone marrow. It was granted access to T1-weigthed,
Short-Time Inversion Recovery (STIR), In and Opposed Phase Gradient Fast Field Echo (FFE) and
DWI magnetic resonance images from patients undergoing chemotherapy as the induction treatment
against multiple myeloma.
From the several MRI images, there is the possibility of collecting a high number of quantitative
imaging features and converting them into mineable high-dimensional data in order to achieve improved
decision support of precision medicine, this is process is known as radiomics. [17] Precision medicine
focuses on understanding individual variability in disease prevention, care and treatment. [18] The
ultimate goal is to establish quantitative predictive or prognostic associations between the clinical images
3
and the medical outcome. [19]
Upon having available the images associated with the problem under study, the workflow in this
radiomics project comprehends the following stages:
1. Segment the regions of interest (ROI) in each patient images;
2. Extract the image features;
3. Perform univariate analysis.
The segmentation may be the most critical and challenging step in the radiomics workflow. The
segmentation masks will guide the subsequent step of feature extraction, which will ultimately condition
the conclusions’ validity withdrawalled from the study. Furthermore, there is a high degree of variability
inherent to the semi-automated process used nowadays derived from inter-user expertise and distinct
image acquisition conditions. As a consequence and to ensure reproducibility, at least two radiologists
are needed for the segmentation step, which leads to increased costs. With all of this in consideration,
there is a consensus that the development of a program capable of performing automated segmenta-
tion with minimal manual intervention is crucial to minimize inter-user variability, while increasing time
efficiency, reproducibility and accuracy. [17] [20]
All the vertebral bodies present in the resonance images are considered ROIs. On Figure 1.1 is
presented a image of a patient’s spine before and after the segmentation process in the sagittal plane
using ITK-SNAP.
Figure 1.1: MRI image (in-phase gradient echo) from a patient’s spine. A Original image and B Seg-mented image with the regions of interest (spine’s vertebrae) filled with a red label.
The MRI scans are composed by several images of the spine that cover different longitudes in the
sagittal plane. Each set of ROIs forms the volume of interest (VOI). The imaging features collected are
representative of the VOI.
In the present case, the features extracted will be only referring to first-order statistics, comprehend-
ing 18 features posteriorly subjected to an univariate analysis. The analysis was limited to these type
4
of features due to the use of parametric quantitative maps (ADC and FF) and not conventional originally
acquired images and limited to this type of analysis due to the small number of patients, that does not
allow a reliable multivariate analysis. The software chosen to do the analysis of the extracted imaging
features was RapidMiner, with the aid of R studio.
Radiomics is a fairly young field and it has a great potential to accelerate precision medicine. The
goal of radiomics is to link the image features to phenotypes or molecular signatures, for that reason
it is necessary to develop an integrated database wherein the images and the extracted features are
associated to the clinical and molecular data, without jeopardizing patient privacy by its deidentification.
This database formation is still a challenge to be overcomed.
1.3 Objectives
It is the objective of this work the discovery of accurate imaging biomarkers that can aid in the predic-
tion of treatment response status in an early phase of induction chemotherapy in patients with multiple
myeloma.
Is it possible to predict treatment response either using the pre-treatment imaging data, or the
post-1stcycle data, in order to optimize therapeutic decisions?
1.4 Thesis Outline
All the techniques, parameters and algorithms that contribute to the development of the models attained
are described in the chapter Background (chapter 2). Within this chapter there are several sections
that are briefly explained in more detail below.
Section 2.1: Image Acquisition. The MRI images were obtained using several techniques, such as
T1-weighted, STIR, In and Opposed phase gradient echo and DWI. All the different techniques’ princi-
ples are in this section described. Furthermore, it is explained how the FF and ADC parametric maps
are built from the In and Opposed phase gradient echo and the DWI resonance images, respectively.
Section 2.2: Pre-processing. The MRI images may need correction in order to increase image
quality. Although in this work no pre-processing took place, here is described what led to that decision.
Section 2.3: Segmentation. The images’ segmentation was performed resorting to the platform
ITK-SNAP. It allows a semi-automatic segmentation of the volumes of interest that is further explained
in this section. All the vertebral bodies present in the images were segmented as volumes of interest.
Section 2.4: Feature Extraction. The base of the classifier is the extracted features from the
segmented images. The feature extraction process was achieved using the PyRadiomics software. All
the imaging features used are further explained in this section.
Section 2.5: Clinical Variables. It was granted access to the clinical data of the patients that took
part in this study. It was thought to be interesting and of added value to evaluate this variables in
addition to the imaging features. In this section are described the clinical variables that were additionally
explored.
5
Section 2.6: Statistical Analysis Concepts. In this section is explicit the scenarios analyzed re-
garding response to treatment and the different resources that were used during the implementation
process, which include the univariate approach followed in this work and the performance metrics that
aided the evaluation of the imaging and clinical features.
The MRI imaging techniques and the implementation of the statistical tools used throughout this work
are explored in the chapter Materials and Methods (chapter 3). This chapter is divided in three main
sections explained briefly in more detail below.
Section 3.1: Patient Population. Definition of the patient population considering age, gender and
treatment’s response.
Section 3.2: MRI Imaging Techniques. Discrimination of the technical parameters used during the
resonance images acquisition.
Section: 3.3: Statistical Analysis Implementation. Firstly, in this section is mentioned the struc-
ture and composition of the data sets with the imaging features values (subsections 3.3.1 and 3.3.2).
Secondly, it is explored how were evaluated the features extracted from the ADC and FF maps resorting
to an univariate analysis focused on the Mann-Whitney U-Test and the ROC curve construction for each
attribute (subsection 3.3.3), which is the main purpose of this section. The data analysis is conducted
using the RapidMiner software, with the aid of R studio to create needed function that were not available.
This software has several tools that allow a thorough analysis of the data.
The results achieved from the implementation of the univariate analysis will be inspected on the
chapter Results (chapter 4). These results are focused solely on the univariate analysis due to the
reduced number of patients, since the results produced by the multivariate analysis on these conditions
would not be reliable, they would be prone to overfitting. The univariate analysis will work as proof
of concept. The attributes found more promising in the univariate analysis are graphically explored to
assess the separation between populations. Finally, it is also conducted a graphical appraisal of the
evolution of the signal’s intensity during the first round of treatment in patients with different treatment
response.
On the chapter Discussion (chapter 5) is debated the results achieved in this work, while comparing
them with the literature.
On the chapter Conclusion (chapter 6) is discussed the applicability of the potential biomarkers
found according with the results obtained. Is it plausible to develop the model with an increased number
of patients based on the preliminary results? What is the next step?
6
2 Theoretical Background
2.1 Image Acquisition
MRI is based on the magnetization properties of atomic nuclei. At first there is the alignment of the
protons usually randomly oriented within the molecules of the tissue under observation. This alignment
is perturbed by the introduction of an external radio frequency energy. After the perturbation, the nuclei
will go back to the equilibrium position releasing energy. The information gathered by the emitted signal
of the tissue under examination is then converted into an intensity level, which will be depicted in the re-
sulting image. Different types of images are obtained by the variation of the radio frequency (RF) pulses
properties, such as the repetition time (TR, time interval between successive pulse sequences applied
on the same slice) and the echo time (TE, time interval between the delivery of the radio frequency pulse
and the reception of the echo signal). [21]
The protons spin around the long axis of the primary magnetic field, this phenomenon is called
precession. The Larmor or precessional frequency in MRI refers to the rate of precession of the magnetic
moment of the proton around the external magnetic field . The frequency of precession is proportional
to the strength of the magnetic field. [22] Each substance will have a distinct Larmor frequency. Due
to this difference in resonance frequencies, the spins of different substances go in and out of phase
with each other as a function of time. When the protons precess together, in other words, when they
have overlapping magnetic moments, they are considered in-phase. When the protons do not precess
together, they are considered out-of-phase. In the particular case where the protons are 180° out-of-
phase, they are considered as in opposed-phase. The period of this phase cycling is equal to the
inverse of the frequency difference between spins. Each state, in- and opposed-phase occurs once per
cycle. [23]
Spin echo (SE) is a fundamental pulse sequence in MRI. The sequence is composed by an excitation
pulse and at least one refocusing pulse. Usually the flip angles of the excitation and refocusing pulses
are set to 90° and 180°, respectively. A refocusing RF pulse has the objective of refocusing the spins that
have dephased. Explicitly, the resulting images can be acquired with either a single-echo or a multiple-
echo pulse sequence. The difference relies on the number of RF refocusing pulses applied within each
TR interval after the initial longitudinal magnetization. The main advantage of SE pulse is the possibility
of combining the TR and TE values in order to create specific contrast weighting, such as T1- and T2-
weighted images. [24]
2.1.1 T1-weighted
Different tissues can be characterized by two distinct relaxation times, T1 and T2. T1 is the longitudinal
relaxation time and it determines the rate at which protons return to equilibrium after excitation, in other
7
words, it is a measure of the time taken for spinning protons to realigned with the magnetic field. [21] [25]
T2 is the transverse relaxation time and it determines the rate at which excited protons go out-of-phase
with each other, in other words, it is a measure of the time taken for spinning protons to lose phase
coherence in the transverse plane. [21][26]
T1-weighted images reflect the difference between T1 relaxation times of distinct tissues and they
are produced by short TR and TE. In this type of images, fat is bright and water is dark. On the T1-
weighted MRI images, since it occurs a accumulation of water on the lesions sites, these regions will
appear darker.
2.1.2 Short-Time Inversion Recovery
STIR is a fat suppression technique. This technique sequence begins with a 180° pulse, which reverts
the longitudinal magnetization for all tissues. During the time interval that follows this first pulse, it occurs
T1-relaxation seeking to restore the equilibrium alignment in the positive direction of the field. After a
selected interval duration, it is generated a longitudinal magnetization (90° pulse). If at the time of this
second pulse the longitudinal magnetization of a tissue is close to zero, the signal will be equal or very
close to zero. The time interval between the two pulses is denominated as inversion time and it differs
between tissues according with its T1. For fat suppression the inversion time (TI) is given by the Equation
2.1 where the fat signal is equal zero. [27]
TI = ln(2) · T1fat (2.1)
2.1.3 In and Opposed Phase Gradient Fast Field Echo
In and Opposed phase gradient echo (GRE) is a type of dual echo sequence, which means that for a
pulse sequence two echos are acquired. The fast field echo was the type of GRE pulse sequence used
to obtain the in- and opposed-phase MR images. The FFE is primarily used for anatomical imaging. [28]
Water and fat protons precess at fractionally different frequencies due to their different local chemical
environment. Unlike SE, GRE pulse sequences do not have a RF refocusing pulse , which means
that the water and fat protons will be periodically in both in-phase (overlapping magnetic moment) and
opposed-phase (opposed magnetic moment). These two states correspond to different TE. In the first
situation the signal intensities will add, leading to an increase of the signal intensity. On the other hand,
on the second situation the signal intensities will subtract, leading to a decrease of signal intensity. This
sequence main application is the determination of the fat fraction within each individual voxel. [29]
The signal intensity of each voxel is determined in both in-phase (IP) and opposed-phase (OP)
images. The fat fraction (FF) from the vertebrae is then calculated according to Equation 2.2. [30]
FF =IP −OP
2 · IP(2.2)
8
2.1.4 Diffusion-Weighted Imaging
DWI is a technique very sensitive to cell density, relative content of fat and marrow cells, water content
and bone marrow perfusion. Thus, it is commonly used to measure bone marrow composition and
cellularity.
The signal intensity of DWI is dependent on the stochastic Brownian motion of water molecules within
a tissue at the microscopic level and on the diffusion gradient strength used. The factor that reflects the
strength and timing of the gradients used to generate these images is called b-value. This value is a
function of the strength, duration and time interval between two strong gradient pulses generated during
the DWI pulse sequence. The increment of any of these variables leads to an increase of the b-value.
[24] [31]
The objective behind the variation of the b-value during DWI image acquisition is the elaboration of
the apparent diffusion coefficient (ADC, mm2/s) map. The ADC is a direct indicator of water motion
within the extracellular and intracellular space, thus it can be directly related to tissue cell density. This
value can be calculated from the exponential decay of the signal intensity (S) as a function of the b-value
(b, s/mm2), as shown by Equation 2.3. [31][32][33]
S = S0 e−b·ADC (2.3)
At this equation, S0 corresponds to the signal intensity when the b-value is equal to zero. [34]
Lesion sites display higher signal intensity in DWI images, due to a low amount of fat cells and a
high retention of water molecules, as a consequence of a higher cell density that restricts the water
molecules’ diffusion.
2.2 Pre-processing
There was no need for pre-processing. On both ADC maps or In and Opposed phase gradient echo
images, five and two 3D volumes, respectively, are being transformed through the relationship between
voxels at the same position.
Noise removal is a common procedure. Nevertheless, this could lead to the erase of information
contained in the images for higher b-values, for instance, where the signal intensity is lower. MR signal is
usually relative, with large differences between scanners and vendors. By normalizing the image before
feature calculation, this confounding effect may be reduced. However, if only one specific scanner
is used or if the images reflect an absolute value (e.g. ADC maps, T2maps (not T2 weighted)), the
normalization is facultative.
2.3 Segmentation
ITK-SNAP was the tool used to perform the identification of the ROIs on the patients’ spines during
this work. This software allows the consecutive analysis of a set of images collected according with the
9
spine’s sagittal plane. It provides a segmentation tool based on the user input which comprehends a
presegmentation stage which roughly delimits regions based on signal intensity, followed by the iden-
tification of the regions of interest within the delimited areas and finally there is contour evolution that
attempts to fill the areas previously selected. All these steps require user input. To improve the semi-
automatic segmentation achieved, the images can be further worked on with a coloring tool. The extend
of this procedure is highly dependent on image quality.
Segmentation masks were constructed during this work by the author of this thesis for all four types
of MRI images (T1-weighted, STIR, In and Opposed phase gradient echo and DWI) comprehending all
visible vertebrae. Nevertheless, only two masks will be utilize in the feature extraction step: DWI and In
and Opposed phase gradient echo, as the objective is to develop a classifier based on the ADC and FF
maps.
2.4 Feature Extraction
In order to develop an imaging biomarker from the imaging features, the python package PyRadiomics
was used. Concerning the work developed during this thesis, it was used first-order statistics features
extracted from the ADC and FF parametric maps obtained from the original DWI and In and Opposed
phase dual FFE resonance images, respectively. These features describe the distribution of voxel inten-
sity within the mask region.
As defined by the documentation found on the pyradiomics library ”First-order statistics describe the
distribution of voxel intensities within the image region defined by the mask through commonly used and
basic metrics.” [35]
The several metrics used are enumerated and briefly described underneath. [35] Although 19 fea-
tures are presented, only 18 were evaluated in the posterior work. The Variance and Standard Deviation
are highly correlated, therefore only the Variance will be considered.
In the equations displayed regarding some of the metricsX stands for the set ofNp voxels included in
the VOI; P (i) portrays the first order histogram with Ng discrete intensity levels, where Ng is the number
of non-zero bins with a pre-defined bin width and p(i) represents the normalized first order histogram
[P (i)/Np]. The width is usually set in order to obtain a representative number of bins, in between 30 and
130 bins , without compromising the bin’s width.
The array shift (c) is an optional parameter and it ensures that the voxels with the lowest gray values
contribute the least for the metric in question, instead of the voxels with a value closer to zero. This is
commonly used in normalized data.
1. ”Energy is a measure of the magnitude of voxel values in an image.” It is the sum of each squared
voxel intensity, as shown in Equation 2.4, where c stands for an array shift.
energy =
Np∑i=1
[X(i) + c]2 (2.4)
2. ”Total Energy is the value of Energy feature scaled by the volume of the voxel in cubic mm.” It
10
reflects the energy affected by the voxel volume, as shown in Equation 2.5
total energy = Vvoxel
Np∑i=1
[X(i) + c]2 (2.5)
3. ”Entropy specifies the uncertainty/randomness in the image values. It measures the average
amount of information required to encode the image values.” This metric is translated in Equation
2.6, where ε is an arbitrarily small positive number (≈ 2.2 × 10−16).
entropy = Vvoxel
Ng∑i=1
p(i) log2 [p(i) + ε] (2.6)
4. Minimum is the lowest gray level intensity value found within the VOI (Equation 2.7).
minimum = min(X) (2.7)
5. 10th Percentile represents the value below which 10% of the observations fall.
6. 90th Percentile represents the value below which 90% of the observations fall.
7. Maximum is the highest gray level intensity value found within the VOI (Equation 2.8).
maximum = max(X) (2.8)
8. Mean expresses the average gray level intensity within the VOI (Equation 2.9).
mean =1
Np
Np∑i=1
X(i) (2.9)
The mean is commonly represented by µ in statistics.
9. Median is the median gray level intensity within the VOI.
10. Interquartile Range (IQR) expresses the gray intensity values comprehended between the 25th
(P75) and 75th (P75) percentiles, as represented in Equation 2.10. It consists in the 50% of the
gray intensity level values found in the middle of the distribution.
interquartilerange = P75 − P75 (2.10)
11. Range represents the distance between the minimum and maximum gray intensity values, as
shown in Equation 2.11. It is the range of gray values found within the VOI.
range = max(X) −min(X) (2.11)
11
12. ”Mean Absolute Deviation (MAD) is the mean distance of all intensity values from the Mean value
of the image array.” This metric is reflected in Equation 2.12.
MAD =1
Np
Np∑i=1
[X(i) − X] (2.12)
13. ”Robust Mean Absolute Deviation (rMAD) is the mean distance of all intensity values from the
Mean value calculated on the subset of image array with gray levels in between, or equal to the
10th and 90th percentile.” This metric is reflected in Equation 2.13, where N10−90, X10−90(i) and
X10−90 are the number of voxels, the gray intensity level for each voxel and the mean intensity
value, respectively, between the 10th and 90th percentile.
rMAD =1
N10−90
N10−90∑i=1
[X10−90(i) − X10−90] (2.13)
14. ”Root Mean Squared (RMS) is the square-root of the mean of all the squared intensity values.” As
well as the metric Energy, it is a measure of the magnitude of the image values and it is reflected
in Equation 2.14.
RMS =
√√√√ 1
Np
Np∑i=1
[X(i) + c]2 (2.14)
15. ”Standard Deviation (SD) measures the amount of variation or dispersion from the Mean value.”
It is by definition the square root of the variance and it is given by Equation 2.15.
SD =
√√√√ 1
Np
Np∑i=1
[X(i) + X]2 (2.15)
The standard deviation is commonly represented by δ in statistics.
16. ”Skewness measures the asymmetry of the distribution of values about the Mean value.” A neg-
ative skew is related with a longer left tail, while a positive skew is related with a longer right tail.
In either case, the asymmetry is related with a uneven distribution of the values about the mean,
more concentrated to the right or to the left, respectively. Equation 2.16 translates this measure,
where δ represents the standard deviation and µ3 stands for third central moment.
skewness =µ3
δ3=
1Np
∑Np
i=1 [X(i) − X]3(√1Np
∑Np
i=1 [X(i) + X]2)3 (2.16)
If this variable takes values ranging from -0.5 and 0.5, the distribution of the data is considered
fairly symmetrical. If the skewness is in between -1 and -0.5 or between 0.5 and 1, the distribution
is considered moderately skewed. Finally, is the skewness is inferior to -1 or greater than 1, the
distribution is considered highly skewed. [36]
17. Kurtosis is a measure of the curvature of the probability distribution of values in the image VOI.
12
”A higher kurtosis implies that the mass of the distribution is concentrated towards the tail(s) rather
than towards the mean. A lower kurtosis implies the reverse: that the mass of the distribution is
concentrated towards a spike near the Mean value.” Equation 2.17 translates this measure, where
δ represents the standard deviation and µ4 stands for forth central moment.
kurtosis =µ4
δ4=
1Np
∑Np
i=1 [X(i) − X]4(1Np
∑Np
i=1 [X(i) + X]2)2 (2.17)
Peter Westfall stands that the kurtosis reflects the tailedness of the distribution, in other words, how
the values are stretched. The formula used above is referred to as kurtosis (Equation 2.17). Some
authors may use the ”excess kurtosis”, which corresponds to kurtosis − 3. A kurtosis equal to 3
or a ”excess kurtosis” equal to zero corresponds to a population normally distributed. Compared
to the normal distribution, if the kurtosis is lower than 3, then its tails are shorter and thinner and if
the value is bigger than 3, then the tails are longer and broader. [37]
18. ”Variance is the mean of the squared distances of each intensity value from the Mean value. This
is a measure of the spread of the distribution about the mean.” It is by definition the standard
deviation to the 2nd power and it is given by Equation 2.18.
variance =1
Np
Np∑i=1
[X(i) + X]2 (2.18)
19. ”Uniformity is a measure of the sum of the squares of each intensity value. This is a measure of
the homogeneity of the image array, where a greater uniformity implies a greater homogeneity or
a smaller range of discrete intensity values.” This metric is reflected in Equation 2.19.
uniformity =
Ng∑i=1
p(i)2 (2.19)
2.5 Clinical Variables
Although the main goal of this study is the discovery of accurate imaging biomarkers, it was explored the
available clinical data from the patients participating in this study to understand if these variables could
also aid in the prediction of treatment’s response in multiple myeloma.
The 23 variables explored are briefly described below.
1. Age at Rx. Age of the patients the time of the diagnosis.
2. Weight. Weight of the patients in the studies.
3. Height. Height of the patients in the studies.
4. Gender. Gender (male or female) of the patients in the studies.
13
5. Eastern Cooperative Oncology Group Performance Status (ECOGPS). It is a scale that has
the underlined objective of describing the patient’s level of functioning. The scale aids to define the
population of patients in the trial to study new treatment methods and it is a way to track alterations
in a patient’s functioning evolution during treatment. [38]
6. Hemoglubin (Hb). It is present in red blood cells as the oxygen-carrying protein. [39]
7. White Blood Cell (WBC). Blood cell responsible for immune response. [39] Some white cells are
responsible for the production of antibodies.
8. Polymorphonuclear leukophils (PMNs). These are white blood cells characterized by the pres-
ence of granules in their cytoplasm. These granules contain enzymes with broad-based activity
that digest microorganisms and are released during innate immune response. [39]
9. Platelet. Irregular disc-shaped blood component that assists in blood clotting. [39]
10. Albumin. The main protein in human blood, which sustains a key role in regulating the blood
osmotic pressure. [39]
11. Creatinine. It is a chemical waste product resulting from the normal muscle metabolism. This
substance is usually excreted through the urine, therefore it can be an indicator of kidney function.
[40]
12. Urea. This substance is normally removed from the blood stream by the kidneys and excreted in
the urine. Excessive urea in the blood may indicate kidney damage. [39]
13. Calcium. Mineral mainly found in the bones, where it is stored. It is essential for healthy bones.
[39] The excess of calcium in the blood stream is called hypercalcemia and it is a common side
effect of multiple myeloma.
14. Alkaline Phosphatase. Enzyme that liberates phosphate under alkaline conditions. High levels
of this enzyme may be an indication of bone disease. [39]
15. Beta-2 Microglobulin (β2M ). This protein is a component of the major histocompatibility complex
class I molecules. It is used as a tumor marker. [41]
16. Lactate Dehydrogenase (LDH). It is an enzyme found in nearly all living cells (animals, plants,
and prokaryotes), which is responsible for the interconversion between the substrate lactate and
NAD+ to the substrate pyruvate and NADH. [42] It is used as a tumor marker. [39] An increased
amount of LDH indicates possible tissue damage. [43]
17. Actual Percentage. Percentage of marrow plasma cells in the bone marrow. Plasma cells are
white blood cells generated in the bone marrow, which secrete antibodies. The population of
plasma cells present in the marrow may be an indicator of disease progression and tumor load.
[44]
14
18. Serum Peak. Serum is the clear liquid that composes the blood together with the plasma. The
plasma contains red cells, white cells and platelets. [39] The serum peak can measure abnormal
amount of proteins present in the blood, such as the M-protein in MM.
19. Serum Immunoglobulin A (Serum IgA). Immunoglobulin A is an antibody. Abnormal levels of
Immunoglobulin A may aid the diagnosis of MM. [45]
20. Serum Kappa Free. Light chains are one of the two components of the antibodies. One of the two
types of light chains present in humans is denominated kappa (κ). The detection of monoclonal
free light chains is an indicator of monoclonal gammopathies. [46]
21. Serum Lambda Free. In addition to the kappa light chain, it exists in humans the lambda (λ) light
chain. As mentioned previously, the detection of monoclonal free light chains is an indicator of
monoclonal gammopathies. [46]
22. Ratio κλ. Ratio between the Kappa and Lambda free light chains.
23. International Staging System (ISS). The ISS predicts the severity of multiple myeloma based
on easily obtained protein concentrations, such as β2M and albumin concentration. The patient
is classified as having stage I, II or III MM. The increasing stage number is an indicator of the
disease’s progression. [47] [48]
2.6 Statistical Analysis Concepts
Data mining is a growing field and it makes use of several machine learning algorithms. The objective
of this subject is to discover novel and useful patterns that might otherwise remain unknown in big data
sets, as well as, to predict the outcome of a future observation. The data set used in this thesis is rather
small; nevertheless, the principles used are the same. [49]
The imaging features to be extracted from the MR images will be evaluated for their predictive value
in the classification of a patient as responsive or non-responsive to chemotherapy. Although only two
population will be considered, responders and non-responders, the response to induction therapy is
classified in a scale ranging from 1 to 6. Based on IMWG response criteria [50]:
1. Stringent Complete Response
2. Complete Response (CR)
3. Very Good Partial Response (VGPR)
4. Partial Response (PR)
5. Stable Disease (SD)
6. Progressive Disease (PD)
15
There are two more categories that are acknowledge by the IMWG but they were not considered in
the response classification: immunophenotypic complete response and molecular complete response.
The IMWG criteria is on appendix A.1 with a merely informative purpose.
The patients are distributed among two classes, responders and non-responders, considering their
final response to treatment. Three possible approaches are examined regarding the patients’ dis-
tribution among the two classes. Two scenarios (scenarios 1 and 2) were created due to the lack of
consensus in whether a partial response to treatment should be considered an effective response to
treatment. A third situation (scenario 3) is considered, where there is the intention of separating patients
that have a complete or very good response from the ones that only display partial response to the
induction therapy. In this case, the patients that do not respond to treatment are excluded. This scenario
surges from the elevated number of patients that respond to treatment, but that may need treatment
adjustments according with the kind of response they present. Beneath all scenarios are described.
• Scenario 1: the partial response (4) is treated as an effective response to treatment. Responders:
patients showing a minimum of partial response, classified as 1, 2, 3 and 4. Non-responders:
patients that display stable or progressive disease, classified as 5 and 6.
• Scenario 2: the partial response (4) is not treated as an effective response to treatment. Re-
sponders: patients showing complete or very good partial response, classified as 1, 2 and 3.
Non-responders: patients that show partial response or display stable or progressive disease,
classified as 4, 5 and 6.
• Scenario 3: only patients with a minimum of partial response are evaluated and the patients with
partial response are treated as a separate group. Responders: patients showing complete or
very good partial response, classified as 1, 2 and 3. Non-responders: patients that show partial
response to treatment, classified as 4.
These designations will be used throughout this work (scenario 1, scenario 2 and scenario 3.
In the appendix B.1 is depicted the response after the induction therapy in the scale previously
presented (1 to 6) for all the three scenarios.
2.6.1 Univariate Analysis
A univariate analysis consists in the review of the effect of one independent variable in a single depen-
dent variable (response or outcome variables). [51] [52]
Statistical Hypothesis Testing
In statistical hypothesis testing are confronted two hypothesis regarding an unknown parameter from a
known distribution of a random variable of interest. The outcome of the test will dictated if a defined null
hypothesis is rejected when confronted with an alternative hypothesis. [53]
The decision made regarding the rejection or non-rejection of the null hypothesis is derived from a
statistical test based on the information collected from a sample. Nevertheless, there are always two
16
types of errors that may happen: type I error - the null hypothesis is rejected although it is true and type
II error - the null hypothesis is not rejected although it is false. The highest significance level (α) that
does not lead to the rejection of the null hypothesis is denominated p-value. [53]
One can say that a test has statistical significance if the p-value is lower than the level of significance
defined in the study.
A widely used statistical test that may be applied to assess the separation of two populations is the
t-Test which compares the means of two independent groups. [54] This type of parametric test assumes
that the two populations have normal distributions, which is not guaranteed due to the small sample size.
Parametric tests have a greater statistical power, in other words, there is a higher probability that the
test correctly rejects the null hypothesis, avoiding type II errors. [55] Therefore, it is of general agreement
that the adequate parametric test should be used to evaluate data if there is no reason to believe that
its assumptions are being violated. [56] Nevertheless, nonparametric tests have been reported as a
satisfactory alternative in biomedical sciences, especially in small samples. [54] Nonparametric tests do
not make any specific assumptions regarding the population parameters that characterize the underlying
distribution, unlike the parametric tests, therefore they are the best suited for this study. [56] Furthermore,
these tests use the median as the location measure, instead of the mean, as it presents a lower variation
in skewed data and in the presence of outliers. [54]
The Mann-Whitney U-Test evaluates if two independent samples represent two populations with
different median values. [56] The null hypothesis states that both samples come from the same popula-
tion. If the Mann-Whitney U-Test is found statistically significant, one can conclude that there is a high
likelihood that the samples represent populations with different median values.
Multiple Comparison Correction
When a large number of statistical tests is performed, there is a chance that in some of them, the p-value
follows under a defined critical value by chance, leading to the wrong rejection of the null hypothesis
(type I error). In order to minimize the false positive rate, corrections to the p-value may be applied. [57]
The Bonferroni correction, Holm method and Benjamini-Hochberg procedure are three possible multiple
comparison methods and they are further explored in this thesis.
In the Bonferroni correction instead of using an usual critical significance level (commonly 0.05), it
is used a lower critical value. One would estimate it by dividing the usual critical significance level (α) by
the number of tests (m). A test is considered statistically relevant if the p-value associated (Pi) is lower
than the new critical significance level found, as shown in Equation 2.20.
Pi <α
m(2.20)
This correction minimizes the family-wise error rate (FWER), which is the probability of making at
least one false conclusion, in other words, it is the probability of making a type I error at all.
This correction is mainly useful for a small number of multiple comparisons and when it is expected
that just a couple of attributes are meaningful. In big data sets there is the risk of estimating an unrea-
17
sonably small critical value, leading to a high rate of false negatives. [57] Nevertheless, this approach is
often considered very conservative.
One can also say that the p-values obtained for each attribute may be adjusted and compared to the
usual critical significance level, instead of lowering the latter. That would be obtained adapting Equation
2.20, where the p-value of each attribute is multiplied by the number of tests in order to obtain a adjusted
p-value .
Each test is considered statistically relevant if the adjusted p-value associated is lower than the
standard critical significance level .
The Holm method is considered to be less conservative than the Bonferroni correction, while still
trying to minimize the FWER. In this approach, the hypothesis are ordered from the smallest p-value
to the greatest and ranked, where the hypothesis with the smallest p-value has a rank of k = 1. The
p-value for each hypothesis is obtained by dividing the significance level pretended (α) by the possible
true hypothesis, which corresponds to the total number of hypothesis (m) minus the hypothesis already
sequentially rejected (m + 1 − k). A hypothesis is rejected if the unadjusted p-value (Pk) is lower than
the adjusted significance level, as shown in Equation 2.21. [58]
Pk <α
m+ 1 − k(2.21)
The variable k corresponds to the rank position of the hypothesis being tested. Regarding the first
ranked hypothesis, the adjusted significance level is given similarly to the Bonferroni correction and is
equal to α/m.
In the same fashion as the previous method (Bonferroni correction), instead of an altered significance
level, it can be determined an adjusted p-value for each test by multiplying the original p-value by the
denominator in Equation 2.21 (m + 1 − k). and the adjusted p-value is then compared to the usual
significance level. The test is considered statistically significant if the adjusted p-value is lower than the
pretended significance level.
The Benjamini-Hochberg (BH) procedure decreases the false discovery rate (FDR) by altering the
p-value determined by the test for each attribute. This false discovery rate reflects the expected propor-
tion of type I errors. [59]
Using this approach, the attributes are initially arranged in increasing order and ranked, where the
smallest p-value has a rank of k = 1. Then, each individual p-value is compared to the Benjamini-
Hochberg correspondent critical value. These values are determined as shown in Equation 2.22.
BH critical value =k
m·Q (2.22)
where k is the rank of the variable, m is the number of tests and Q is the defined false discovery rate.
Each attribute will have a corresponding BH critical value calculated. The highest ranked attribute
(attribute with the higher value of k) that displays a p-value under its respective BH critical value is
considered the threshold at which the attributes should stop being considered statistically relevant, in
other words, all the attributes ranked with a lower value for k should be considered statistically relevant
18
even if their individual p-value is superior to their calculated BH critical value.
It can also be calculated an adjusted BH p-value. This can be either the raw p-value multiplied bym/k
or the adjusted p-value for the next higher raw p-value, whichever is the smallest. When the adjusted
p-value is smaller than the false discovery rate established the test is considered significant. [57]
In Bonferroni, Holm and BH corrections, all the individual tests are considered independent from
each other.
Unlike the univariate analysis, multivariate statistics allows the analyses of several independent
and/or dependent variables. This type of analysis takes into consideration the correlation among de-
pendent variables. [51][52] This type of analysis will not be performed given the small sample size, as
mentioned previously.
2.6.2 Performance Metrics
The performance metrics reflect how good a classifier is and allow the comparison among classifiers.
One cannot say that a performance metric is better to evaluate a classifier when compared to all the
others. A combination of several measures may be used to investigate a classifier performance. [60]
Confusion Matrix
A confusion matrix enables the performance evaluation of a classifier on a given labeled data set. This
performance measure relies on the combination between predicted and actual values.
In a binary classifier (where the outcome is considered either positive or negative), a confusion matrix
can be generally represented as demonstrated by Table 2.1, where the rows correspond to the predicted
values, while the columns represent the actual values.
Table 2.1: General form of a confusion matrix.
Actual Positive Actual Negative
Predicted Positive True Positive False Negative
Predicted Negative False Positive True Negative
The combination between the predicted and actual class value receives different designations. A true
positive (TP) occurs when a instance is correctly predicted as positive. A false positive (FP) occurs
when a instance is mistakenly predicted as positive. A false negative (FN) occurs when a instance is
mistakenly predicted as negative. A true negative (TN) occurs when a instance is correctly predicted
as negative.
From a confusion matrix several performance metrics can be calculated.
• Accuracy translates the frequency of correct classifications (Equation 2.23).
accuracy =TP + TN
TP + FP + FN + TN(2.23)
19
• Precision, also known as positive predictive rate, represents the fraction of positives correctly
predicted over the total of instances predicted as positive (Equation 2.24).
precision =TP
TP + FP(2.24)
• Sensitivity, also known as recall or true positive rate (TPR), translates the fraction of positives
correctly predicted over the total of instances with real positive value (Equation 2.25).
sensitivity =TP
TP + FN(2.25)
• Specificity represents the fraction of negatives correctly predicted over the total of instances with
real negative value (Equation 2.26).
specificity =TN
TN + FP(2.26)
Complementary to the specificity performance metric there is the false positive rate (FPR), which
translates the frequency which an observation is wrongly predicted as positive (Equation 2.27).
FPR = 1 − specificity =FP
TN + FP(2.27)
• F1-measure summarizes the precision (p) and recall (r) performance measures, since it represents
a harmonic mean between these two (Equation 2.28). [49]
F =2 · r · pr + p
(2.28)
The metrics described above will be later used during the univariate analysis.
ROC Curve
A ROC curve is a performance measurement for classification problems at several thresholds settings.
The thresholds are defined according with each variable value.
The ROC curve is traced as TPR vs FPR. For a certain attribute, at each defined threshold, all the
observations above that threshold will be predicted as belonging the positive class, while the remaining
observations will be considered as belonging to the negative class. Then the predicted class values
will be compared to its actual class values, leading to the calculation of the true positive rate and of the
false positive rate. To each threshold will correspond a point in the graphic (TPR, FPR). When all the
thresholds for a certain attribute are evaluated, its ROC curve is traced. A good classifier should be
located as close as possible to the upper left corner of the diagram, while a random classifier will lie
along the diagonal. [49] [61]
The area under the curve (AUC) translates the separability of the two classes. The AUC presents
values ranging between 0 and 1. The higher the AUC the better the classifier capacity of distinguishing
20
the two classes. When the AUC is equal to 0.5 it means that the classifier as no discriminating power,
the decisions are made with the same certainty as a coin toss. [49] [61] An AUC between 0.5 and 0.7 is
considered poor, in between 0.7 and 0.8 it is considered good, in between 0.8 and 0.9 it is considered
very good and if it is over 0.9 it is considered excellent.
The threshold chosen to be applied by the classifier should have the best possible combination be-
tween sensitivity and specificity. The compromise achieved between both is always problem dependent.
Usually a test with high sensitivity has low specificity. Sensitivity is chosen in detriment of specificity if it
is more important to correctly identify a positive outcome than a negative one, the test is subjected to a
higher number of false positives. The opposite is also true, specificity is chosen in detriment of sensitivity
if it is more important to correctly identify a negative outcome than a positive one, the test is subjected
to a higher misidentification of positives cases.
The Problem of Class Imbalance
Accuracy is a performance metric commonly used to analyze the performance of a classifier. Never-
theless, it may not be suited when the data set has imbalanced class distributions, which is a regular
occurrence in real applications. The accuracy of a classifier can be extremely high in the presence of
rare events. For instance, lets assume that only 1% on the vaccines produced by a certain company
are defective. If the classifier predicted that all vaccines are good, it would have an accuracy of 99%,
however it is important to detect this rare event. [49]
For binary classification other performance metrics can be used, such as precision, recall and the
ROC curve. Usually, the rare class is denoted as the positive class, while the majority class is denoted
as the negative class. [49]
Both precision and recall focus on the positive class. The first declares the true positives among
the instances predicted as positive, while the later evaluates the predicted true positives among all the
actual positive instances. The F1-measure is often optimized when the positive class is considered more
interesting than the negative class, since it will conduct to an optimal comprise between both precision
and recall. [49]
The ROC curve is a graphical approach that depicts the trade off between the TPR and the FPR, as
previously explained. This metric is appropriate to compare the relative performance between different
classifiers. In particular when combined with the AUC, which allows to infer which classifier is better on
average. [49]
21
22
3 Materials and Methods
3.1 Patient Population
The group of patients recruited for this study is composed by 30 people, 15 women and 15 men, with
an age range from 37 to 79 years, a mean age of 63 and a median age of 68 years. The classification
of patient’s final response to treatment was made according with the IWMG response criteria: 1 patient
showed stringent complete response, 0 patients with complete response, 11 displayed very good par-
tial response, 12 presented partial response, 4 exhibit stable disease and 2 were classified as having
progressive disease. [50]
3.2 MRI Techniques
Patients underwent MRI before chemotherapy and after they had completed one cycle of chemotherapy,
with the exception of 4 patients (ID: 27 to 30), to whom the magnetic resonance images were acquired
before and after all the chemotherapy cycles. All the MRI examinations were conducted in the University
Hospital of Athens, Greece. In order to obtain the magnetic resonance images, the following pulse
sequences were performed: T1-weighted sagittal lumbar spine (repetition time: 400 msec, echo time:
7.4 msec, section thickness: 4.0 mm, section gap: 0.8 mm, image matrix: 246 x 512, number of signals
acquired: 4, field of view: 300 x 300 mm, acquisition time: 138 sec), short TI inversion recovery sagittal
lumbar spine (repetition time: 2500 msec, echo time: 60 msec, section thickness: 4.0 mm, section gap:
0.8 mm, image matrix: 214 x 512, number of signals acquired: 4, field of view: 300 x 300 mm, acquisition
time for each sequence: 180 sec), dual gradient-echo in- and opposed-phase lumbar spine (repetition
time: 300 msec, echo time for the opposed and in phase, respectively: 2.3/4.6 msec, section thickness:
4.0 mm, section gap: 0.8 mm, flip angle: 25°, image matrix: 122 x 512, number of signals acquired:
4, field of view: 300 x 300 mm, acquisition time for each sequence: 116 sec) and DWI (steady-state
echo-planar lumbar spine (repetition time: 2000 msec, echo time: 75 msec, section thickness: 5.0 mm,
section gap: 1.0 mm, image matrix: 152 x 256, number of signals acquired: 8, field of view: 300 x 300
mm, acquisition time: 308 sec). The b-values used for the DWI pulse sequence were 0, 150, 25, 500
and 750 sec/mm2. [62]
Machine: 1.5-T unit (Philips Healthcare, Best, the Netherlands) with a surface phased-array coil.
23
3.3 Statistical Analysis Implementation
3.3.1 Data sets
The MRI images were acquire before and after the first round of chemotherapy. For this reason, one
may consider three distinct situations for data analysis: pre-treatment (it is composed by the imaging
features related with the MRI images acquired before treatment), post-1stcycle (it is composed by
the imaging features related with the MRI images acquired during treatment after the first round of
chemotherapy) and delta (it consists on the difference between the post-1stcycle and pre-treatment
moments, it represents the evolution of the signal’s intensity over the first round of chemotherapy). For
each parametric map (ADC and FF), three data sets were organized, each one considering a different
situation (pre-treatment, post-1stcycle or delta).
There is a seventh data set considered, this accounts for an additional exploitation of the data avail-
able. It comprehends the clinical data from all the patients evolved in this study. Same of the feature
evaluated were: age, height, gender, weight and protein levels, all of which are explained in the section
2.5.
As mentioned previously in the section 2.6, three situations were considered in terms of response
classification (scenario 1, 2 and 3). The labelled data sets will have different class proportions.
When the partial response to the treatment is considered an effective response to treatment (scenario
1), the resulting data set recognizes approximately 20% of the patients as non-responsive and 80% as
responsive to treatment. On the other hand, if the partial response to treatment is not considered an
effective response to treatment (scenario 2), the resulting data set recognizes approximately 60% of the
patients as non-responsive and 40% as responsive. In scenario 1, one obtains a more unbalanced data
set when compared with scenario 2. When the patients with stable or progressive disease are excluded
from the study (scenario 3), approximately half of the patients are considered as responsive, while the
other half is considered non-responsive to treatment.
3.3.2 Data Preparation
Each data set was individually inspected and the patients with missing imaging features values were
removed from the data set under examination. Furthermore, patients 27 to 30 underwent the second MRI
scan not after the first round of chemotherapy but in the end of the complete chemotherapy treatment,
for this reason, in the data sets with respect to the post-1stcycle and delta situations these patients are
not present. Taking into consideration the above stated, the data sets do not present necessarily an
equal number of patients, ranging from the original 30 patients down to 23. In the particular case of
scenario 3, some data sets only consider 18 patients, once the ones with stable or progressive disease
are removed.
24
3.3.3 Univariate Analysis
p-value Evaluation
The Mann-Whitney U-Test is applied using the operator Mann-Whitney U-Test, available on Rapid-
Miner, that computes the p-value associated with the null hypothesis - whether the two samples (re-
sponders and non-responders) come from the same population by comparing their medians.
Different multiple comparison corrections were applied: Bonferroni, Holm and Benjamini-Hochberg.
These methods are successively less conservative and it is expected that in the less conservative cor-
rections more attributes are indicated as statistically significant.
It was defined that if the adjusted p-value correspondent to a certain attribute is lower than the
standard significance level of 0.05, one can be confident in the rejection of the null hypothesis for the
Mann-Whitney U-Test, which states that the two population have the same median, in order words, by
rejecting the null hypothesis, the two populations are considered well separated by the attribute tested.
In order to obtain the adjusted p-values for each of the multiple comparison corrections considered,
a small script was compiled on R and further applied in the process developed with the software Rapid-
Miner. The script created will take as input the data generated from the Mann-Whitney U-Test and
apply the R function p.adjust, which will generate the new adjusted p-value for each of the attributes
considered by the multiple comparison corrections previously described. The script compiled on R for
the calculation of the adjusted p-values in available in the Appendix Subsection C.1.1. The number of
comparisons is automatically set for the number of attributes present in each data set. Since the imaging
features p-values from the ADC and FF maps are corrected for multiple comparisons together in each
condition (pre-treatment, post-1stcycle and delta) and in each possible scenario (1, 2 and 3), the number
of comparisons is equal to 36.
The general process developed for p-value collection in the univariate analysis is depicted in Figure
3.1.
ROC Curve Evaluation
The AUC performance measure was also used to evaluate how well each attribute performs on the
binary classification.
The code for the AUC attained is displayed in the Appendix Subsection C.1.2. The code was written
in R to be implemented in a recursive fashion using the package pROC.
When tracing the ROC curve, the function roc available in the package pROC allows the definition
of the direction in which the thresholds are evaluated: in ascending or descending order of value. The
function automatically compares the medians from both groups. If the median of the positive class is
higher than the median from the negative class, the thresholds are evaluated in descending order (from
the higher to the lower threshold value). On the other hand, if the median of the negative class is higher
than the median from the positive class, the thresholds are evaluated in ascending order (from the lower
to the higher threshold value).
In the ROC curves construction, the positive class is the minority class. In scenario 1 and in the
25
Figure 3.1: General process used for the collection of the p-values in the univariate analysis.
FF map data sets from scenario 3, the positive class will be the non-responders. In this case, and
considering the problem at hands, one wants to minimize the FPR, even if it means to have a worst TPR.
The motive that underlines this decision is the priority given on continuing the treatment with people that
are responding to it. By minimizing the FPR, the specificity is being maximized, which directly translates
into the maximization of the number of patients correctly predicted as true negatives, in other words,
true responders to treatment. If the responders are the positive class (scenario 2 and ADC map data
sets from scenario 3), then one would want the opposite situation, which means maximizing the TPR,
even considering a worst FPR.
The ROC curve of each attribute with an AUC over 0.70 was further analyzed. For each data set
were selected the attributes with the considered best sensitivity and specificity. In scenario 1 and in the
FF map data sets from scenario 3, for the TPR and the FPR were only considered values over 0.6 and
under 0.2, respectively. While in scenario 2 and ADC map data sets from scenario 3, for the TPR and
the FPR were only considered values over 0.8 and under 0.4, respectively. Afterwards, the attributes
from this filtered subset were compared among themselves with the intend of finding which attributes
would be the best possible biomarkers.
All attributes are considered to be independent in a univariate analysis and they are analyzed indi-
vidually when evaluating the ROC curve.
26
4 Results
4.1 Univariate Analysis
Using univariate analysis it is possible to develop an idea of how the attributes would behave in an
independent manner, revealing which ones would be more interesting to further explore. These consid-
erations could be then transported to the multivariate analysis.
4.1.1 p-value and ROC Curve Evaluation for the First Order Imaging Features
A written analysis was conducted in order to analyze the results depicted in Tables 4.1 to 4.9 that display
all the original and adjusted p-values and the AUC (retrieved from the ROC curve) for each test of each
study. As mention previously, it was chosen the significance level of 0.05. All the p-values that fall under
this threshold are colored blue. The AUC values are also presented in these tables and the ones equal
or over 0.70 are highlighted in orange. This threshold was chosen to guarantee that the attributes further
evaluated have an AUC value associated with the ROC curve are considered at worst good.
In the pre-treatment data set in scenario 1 there are 4 attributes that show statistical significance
regarding the ADC map (interquartile range, kurtosis, mean absolute deviation and minimum) and there
are 14 attributes that show statistical significance concerning the FF map (90 percentile, energy, entropy,
interquartile range, maximum, mean, mean absolute deviation, median, range, robust mean absolute de-
viation, root mean squared, skewness, total energy and uniformity) after the Mann-Whitney U-Test. Al-
though none of the attributes survive the more conservative multiple comparison corrections (Bonferroni
correction and Holm’s method), there are 10 attributes that survive the Benjamini-Hochberg procedure:
the attribute kurtosis regarding the ADC map and the attributes 90 percentile, energy, entropy, mean,
median, RMS, skewness, total energy and uniformity concerning the FF map. A considerable amount of
attributes displays an AUC over 0.70, 8 attributes regarding the ADC map (entropy, interquartile range,
kurtosis, mean absolute deviation, minimum, robust mean absolute deviation, uniformity and variance)
and 17 out of the 18 attributes concerning the FF map, all with the exception of the minimum. All the
attributes that survive the Benjamini-Hochberg procedure, present an AUC over 0.840, which is already
considered a very good value. These results are presented in Table 4.1.
In the pre-treatment data set in scenario 2 only the attribute kurtosis concerning the FF map
shows statistical significance after the Mann-Whitney U-Test and it does not survive any of the multiples
comparison corrections. Just 3 attributes regarding the FF map (interquartile range, kurtosis and robust
mean absolute deviation) display an AUC value over 0.70. These results are presented in Table 4.2.
In the pre-treatment data set in scenario 3 there are 6 attributes that show statistical significance
regarding the FF map (10 percentile, 90 percentile, kurtosis, median, root mean squared and skewness)
after the Mann-Whitney U-Test. Nevertheless, none of the attributes’ p-value remains under the thresh-
27
old of 0.05 when submitted to the multiple comparison corrections. The attributes that show statistical
significance after the Mann-Whitney U-Test are the only ones to display an AUC value over 0.70. These
results are presented in Table 4.3.
In the post-1stcycle data set in scenario 1 there are 5 attributes that show statistical significance
regarding the ADC map (interquartile range, mean absolute deviation, minimum, robust mean absolute
deviation and variance) and there is 1 attribute that shows statistical significance concerning the FF map
(skewness) after the Mann-Whitney U-Test. Nevertheless, none of the attributes’ p-value remains under
the threshold of 0.05 when submitted to the multiple comparison corrections. A considerable amount
of attributes displays an AUC over 0.70, 10 attributes regarding the ADC map (10 percentile, entropy,
interquartile range, kurtosis, mean absolute deviation, minimum, range, robust mean absolute deviation,
uniformity and variance) and 10 attributes concerning the FF map (10 percentile, 90 percentile, energy,
entropy, kurtosis, mean, median, root mean squared, skewness, total energy and uniformity). These
results are presented in Table 4.4.
In the post-1stcycle data set in scenario 2 only the attribute kurtosis concerning the FF map shows
statistical significance after the Mann-Whitney U-Test and it does not survive any of the multiples com-
parison corrections. This attribute is also the only to display an AUC value over 0.70. These results are
presented in Table 4.5.
In the post-1stcycle data set in scenario 3 there are 8 attributes that show statistical significance
regarding the FF map (10 percentile, 90 percentile, energy, mean absolute deviation, median, root
mean squared, skewness and total energy) after the Mann-Whitney U-Test. Nevertheless, none of
the attributes’ p-value remains under the threshold of 0.05 when submitted to the multiple comparison
corrections. An AUC over 0.70 is displayed by 10 attributes in the FF map (10 percentile, 90 percentile,
energy, kurtosis, maximum, mean absolute deviation, median, root mean squared, skewness and total
energy). These results are presented in Table 4.6.
In the delta data set in scenario 1 there is no attribute that show statistical significance after the
Mann-Whitney U-Test. There are only 4 attributes that display an AUC value over 0.70, 1 regrading
the ADC map (variance) and 3 attributes concerning the FF map (energy, mean absolute deviation and
robust mean absolute deviation). These results are presented in Table 4.7.
In the delta data set in scenario 2 there are 5 attributes that display statistical significance regarding
the ADC map (interquartile range, mean absolute deviation, range, robust mean absolute deviation and
variance) and 5 attributes are considered statistically significant concerning the FF map (10 percentile,
90 percentile, mean, median and root mean squared) after the Mann-Whitney U-Test. Nevertheless,
none of the attributes’ p-value remains under the threshold of 0.05 when submitted to the multiple com-
parison corrections. Some attributes display an AUC over 0.70, 8 attributes regarding the ADC map (90
percentile, entropy, interquartile range, maximum, mean absolute deviation, range, robust mean abso-
lute deviation and variance) and 6 concerning the FF map (10 percentile, 90 percentile, mean, median,
root mean squared and skewness). These results are presented in Table 4.8.
In the delta data set in scenario 3 there are 5 attributes that display statistical significance regarding
the ADC map (90 percentile, interquartile range, mean, robust mean absolute deviation and variance)
28
and 6 attributes are considered statistically significant concerning the FF map (10 percentile, 90 per-
centile, mean absolute deviation, median, root mean squared and skewness) after the Mann-Whitney
U-Test. Nevertheless, none of the attributes’ p-value remains under the threshold of 0.05 when submitted
to the multiple comparison corrections. Some attributes display an AUC over 0.70, 9 attributes regarding
the ADC map (90 percentile, entropy, interquartile range, maximum, mean, range, robust mean absolute
deviation, root mean squared and variance) and 6 concerning the FF map (10 percentile, 90 percentile,
mean absolute deviation, median, root mean squared and skewness). These results are presented in
Table 4.9.
Through the Tables 4.1 to 4.9 it is possible to redraw some of the importance of the multiple compari-
son corrections. There is always a significant drop in the number of attributes with statistical significance
when the multiple comparison corrections are applied, as expected. From the 216 attributes evaluated,
35 display statistical significance after the Mann-Whitney U-Test and only 10 survive the Benjamini-
Hochberg procedure. None of the attributes evaluated survives the more conservative approaches,
Bonferroni and Holm corrections.
It is interesting to observe that all the attributes that survive the Benjamini-Hochberg procedure are
first-order statistics which belong to the pre-treatment data sets in scenario 1, and the vast majority was
extracted from the FF map.
29
Table 4.1: p-values achieved with the Mann-Whitney U-Test (MW U-Test) before and after the imple-mentation of the multiple comparison corrections and the AUC value obtained using the ROC curve,with the respective 95% confidence interval, regarding each attribute of the pre-treatment data set inscenario 1. These results were achieved using RapidMiner® and RStudio®. The p-values under the0.05 significance level are stressed in the color blue. The AUC values over 0.70 are stressed in the colororange.
Map Attributep-values
AUCMW U-Test Bonferroni Holm BH
ADC
10 Percentile 0.216 1.000 1.000 0.277 0.667 (0.449, 0.884)90 Percentile 0.554 1.000 1.000 0.664 0.580 (0.276, 0.883)Energy 0.788 1.000 1.000 0.810 0.464 (0.152, 0.775)Entropy 0.053 1.000 0.947 0.084 0.761 (0.463, 1.000)Interquartile Range 0.046 1.000 0.927 0.084 0.768 (0.495, 1.000)Kurtosis 0.008 0.300 0.283 0.042 0.855 (0.679, 1.000)Maximum 0.146 1.000 1.000 0.195 0.696 (0.407, 0.984)Mean 0.788 1.000 1.000 0.810 0.536 (0.252, 0.820)MAD 0.046 1.000 0.927 0.084 0.768 (0.482, 1.000)Median 0.788 1.000 1.000 0.810 0.536 (0.252, 0.820)Minimum 0.036 1.000 0.858 0.084 0.783 (0.555, 1.000)Range 0.146 1.000 1.000 0.195 0.696 (0.402, 0.989)Robust MAD 0.053 1.000 0.947 0.084 0.761 (0.475, 1.000)RMS 0.914 1.000 1.000 0.914 0.486 (0.192, 0.779)Skewness 0.667 1.000 1.000 0.774 0.558 (0.321, 0.795)Total Energy 0.747 1.000 1.000 0.810 0.457 (0.151, 0.763)Uniformity 0.053 1.000 0.947 0.084 0.761 (0.488, 1.000)Variance 0.053 1.000 0.947 0.084 0.761 (0.463, 1.000)
FF
10 Percentile 0.053 1.000 0.947 0.084 0.761 (0.574, 0.949)90 Percentile 0.005 0.184 0.184 0.042 0.879 (0.747, 1.000)Energy 0.010 0.360 0.300 0.042 0.848 (0.665, 1.000)Entropy 0.010 0.423 0.341 0.042 0.841 (0.676, 1.000)Interquartile Range 0.020 0.725 0.504 0.060 0.814 (0.648, 0.981)Kurtosis 0.083 1.000 0.991 0.119 0.735 (0.486, 0.984)Maximum 0.017 0.624 0.451 0.057 0.822 (0.621, 1.000)Mean 0.012 0.423 0.341 0.042 0.841 (0.674, 1.000)MAD 0.044 1.000 0.921 0.084 0.773 (0.513, 1.000)Median 0.009 0.306 0.283 0.042 0.856 (0.698, 1.000)Minimum 0.341 1.000 1.000 0.424 0.371 (0.186, 0.557)Range 0.036 1.000 0.858 0.084 0.784 (0.579, 0.989)Robust MAD 0.038 1.000 0.858 0.084 0.780 (0.566, 0.995)RMS 0.009 0.306 0.283 0.042 0.856 (0.704, 1.000)Skewness 0.009 0.306 0.283 0.042 0.856 (0.702, 1.000)Total Energy 0.007 0.259 0.252 0.042 0.864 (0.703, 1.000)Uniformity 0.012 0.423 0.341 0.042 0.841 (0.666, 1.000)Variance 0.065 1.000 0.947 0.097 0.750 (0.480, 1.000)
30
Table 4.2: p-values achieved with the Mann-Whitney U-Test (MW U-Test) before and after the imple-mentation of the multiple comparison corrections and the AUC value obtained using the ROC curve,with the respective 95% confidence interval, regarding each attribute of the pre-treatment data set inscenario 2. These results were achieved using RapidMiner® and RStudio®. The p-values under the0.05 significance level are stressed in the color blue. The AUC values over 0.70 are stressed in the colororange.
Map Attributep-values
AUCMW U-Test Bonferroni Holm BH
ADC
10 Percentile 0.500 1.000 1.000 0.816 0.576 (0.351, 0.801)90 Percentile 0.590 1.000 1.000 0.816 0.561 (0.318, 0.803)Energy 0.445 1.000 1.000 0.816 0.586 (0.347, 0.825)Entropy 0.753 1.000 1.000 0.822 0.465 (0.232, 0.697)Interquartile Range 0.669 1.000 1.000 0.822 0.548 (0.313, 0.783)Kurtosis 0.472 1.000 1.000 0.816 0.581 (0.348, 0.814)Maximum 1.000 1.000 1.000 1.000 0.500 (0.258, 0.742)Mean 0.500 1.000 1.000 0.816 0.576 (0.333, 0.818)MAD 0.753 1.000 1.000 0.822 0.535 (0.302, 0.769)Median 0.529 1.000 1.000 0.816 0.571 (0.328, 0.813)Minimum 0.515 1.000 1.000 0.816 0.427 (0.274, 0.579)Range 0.857 1.000 1.000 0.882 0.520 (0.279, 0.761)Robust MAD 0.686 1.000 1.000 0.822 0.545 (0.312, 0.779)RMS 0.500 1.000 1.000 0.816 0.576 (0.339, 0.813)Skewness 0.150 1.000 1.000 0.816 0.662 (0.450, 0.873)Total Energy 0.559 1.000 1.000 0.816 0.566 (0.329, 0.803)Uniformity 0.719 1.000 1.000 0.822 0.460 (0.220, 0.700)Variance 0.719 1.000 1.000 0.822 0.460 (0.230, 0.689)
FF
10 Percentile 0.236 1.000 1.000 0.816 0.633 (0.432, 0.834)90 Percentile 0.341 1.000 1.000 0.816 0.607 (0.390, 0.824)Energy 0.642 1.000 1.000 0.822 0.552 (0.3289, 0.775)Entropy 0.150 1.000 1.000 0.816 0.661 (0.447, 0.876)Interquartile Range 0.060 1.000 1.000 0.760 0.711 (0.509, 0.913)Kurtosis 0.023 0.825 0.825 0.760 0.755 (0.556, 0.954)Maximum 0.835 1.000 1.000 0.882 0.523 (0.332, 0.715)Mean 0.403 1.000 1.000 0.816 0.594 (0.375, 0.813)MAD 0.114 1.000 1.000 0.816 0.677 (0.463, 0.892)Median 0.390 1.000 1.000 0.816 0.596 (0.373, 0.820)Minimum 0.296 1.000 1.000 0.816 0.383 (0.204, 0.562)Range 0.227 1.000 1.000 0.816 0.635 (0.416, 0.854)Robust MAD 0.063 1.000 1.000 0.760 0.708 (0.505, 0.912)RMS 0.330 1.000 1.000 0.816 0.609 (0.391, 0.827)Skewness 0.403 1.000 1.000 0.816 0.594 (0.372, 0.816)Total Energy 0.577 1.000 1.000 0.816 0.563 (0.340, 0.785)Uniformity 0.286 1.000 1.000 0.816 0.620 (0.400, 0.839)Variance 0.178 1.000 1.000 0.816 0.651 (0.431, 0.871)
31
Table 4.3: p-values achieved with the Mann-Whitney U-Test (MW U-Test) before and after the imple-mentation of the multiple comparison corrections and the AUC value obtained using the ROC curve,with the respective 95% confidence interval, regarding each attribute of the pre-treatment data set inscenario 3. These results were achieved using RapidMiner® and RStudio®. The p-values under the0.05 significance level are stressed in the color blue. The AUC values over 0.70 are stressed in the colororange.
p-valuesMap Attribute
MW U-Test Bonferroni Holm BHAUC
10 Percentile 0.218 1.000 1.000 0.561 0.652 (0.411, 0.892)90 Percentile 0.758 1.000 1.000 0.758 0.538 (0.273, 0.803)Energy 0.498 1.000 1.000 0.750 0.583 (0.323, 0.844)Entropy 0.622 1.000 1.000 0.750 0.561 (0.295, 0.827)Interquartile Range 0.644 1.000 1.000 0.750 0.557 (0.295, 0.819)Kurtosis 0.758 1.000 1.000 0.758 0.538 (0.270, 0.805)Maximum 0.538 1.000 1.000 0.750 0.576 (0.311, 0.841)Mean 0.538 1.000 1.000 0.750 0.576 (0.313, 0.839)MAD 0.389 1.000 1.000 0.750 0.606 (0.353, 0.859)Median 0.424 1.000 1.000 0.750 0.598 (0.346, 0.852)Minimum 0.735 1.000 1.000 0.758 0.542 (0.396, 0.688)Range 0.667 1.000 1.000 0.750 0.553 (0.286, 0.820)Robust MAD 0.667 1.000 1.000 0.750 0.553 (0.290, 0.816)RMS 0.460 1.000 1.000 0.750 0.591 (0.337, 0.845)Skewness 0.218 1.000 1.000 0.561 0.652 (0.410, 0.893)Total Energy 0.667 1.000 1.000 0.750 0.553 (0.296, 0.810)Uniformity 0.622 1.000 1.000 0.750 0.561 (0.289, 0.832)
ADC
Variance 0.622 1.000 1.000 0.750 0.561 (0.298, 0.823)10 Percentile 0.039 1.000 1.000 0.251 0.754 (0.549, 0.959)90 Percentile 0.025 0.888 0.888 0.251 0.777 (0.555, 0.998)Energy 0.124 1.000 1.000 0.496 0.689 (0.449, 0.930)Entropy 0.424 1.000 1.000 0.750 0.598 (0.351, 0.846)Interquartile Range 0.186 1.000 1.000 0.557 0.663 (0.435, 0.891)Kurtosis 0.036 1.000 1.000 0.251 0.758 (0.548, 0.967)Maximum 0.601 1.000 1.000 0.750 0.564 (0.375, 0.754)Mean 0.242 1.000 1.000 0.581 0.644 (0.408, 0.880)MAD 0.049 1.000 1.000 0.251 0.742 (0.510, 0.975)Median 0.034 1.000 1.000 0.251 0.761 (0.523, 0.999)Minimum 0.148 1.000 1.000 0.514 0.678 (0.474, 0.882)Range 0.498 1.000 1.000 0.750 0.583 (0.343, 0.823)Robust MAD 0.157 1.000 1.000 0.514 0.674 (0.447, 0.901)RMS 0.031 1.000 1.000 0.251 0.765 (0.541, 0.989)Skewness 0.042 1.000 1.000 0.251 0.750 (0.511, 0.989)Total Energy 0.085 1.000 1.000 0.382 0.712 (0.478, 0.946)Uniformity 0.758 1.000 1.000 0.758 0.538 (0.283, 0.793)
FF
Variance 0.295 1.000 1.000 0.665 0.629 (0.390, 0.867)
32
Table 4.4: p-values achieved with the Mann-Whitney U-Test (MW U-Test) before and after the imple-mentation of the multiple comparison corrections and the AUC value obtained using the ROC curve,with the respective 95% confidence interval, regarding each attribute of the post-1stcycle data set inscenario 1. These results were achieved using RapidMiner® and RStudio®. The p-values under the0.05 significance level are stressed in the color blue. The AUC values over 0.70 are stressed in the colororange.
Map Attributep-values
AUCMW U-Test Bonferroni Holm BH
ADC
10 Percentile 0.074 1.000 1.000 0.222 0.762 (0.577, 0.947)90 Percentile 0.495 1.000 1.000 0.659 0.600 (0.307, 0.893)Energy 0.672 1.000 1.000 0.835 0.562 (0.285, 0.839)Entropy 0.085 1.000 1.000 0.222 0.752 (0.452, 1.000)Interquartile Range 0.034 1.000 1.000 0.222 0.810 (0.537, 1.000)Kurtosis 0.111 1.000 1.000 0.222 0.733 (0.516, 0.950)Maximum 0.182 1.000 1.000 0.298 0.695 (0.436, 0.955)Mean 0.871 1.000 1.000 0.896 0.524 (0.204, 0.844)MAD 0.040 1.000 1.000 0.222 0.800 (0.528, 1.000)Median 0.871 1.000 1.000 0.896 0.524 (0.217, 0.831)Minimum 0.040 1.000 1.000 0.222 0.800 (0.560, 1.000)Range 0.143 1.000 1.000 0.249 0.714 (0.457, 0.972)Robust MAD 0.034 1.000 1.000 0.222 0.810 (0.537, 1.000)RMS 0.720 1.000 1.000 0.837 0.552 (0.256, 0.849)Skewness 0.820 1.000 1.000 0.894 0.533 (0.170, 0.896)Total Energy 0.770 1.000 1.000 0.866 0.543 (0.250, 0.835)Uniformity 0.064 1.000 1.000 0.222 0.771 (0.487, 1.000)Variance 0.034 1.000 1.000 0.222 0.810 (0.554, 1.000)
FF
10 Percentile 0.110 1.000 1.000 0.222 0.737 (0.340, 1.000)90 Percentile 0.095 1.000 1.000 0.222 0.747 (0.367, 1.000)Energy 0.110 1.000 1.000 0.222 0.737 (0.354, 1.000)Entropy 0.126 1.000 1.000 0.240 0.726 (0.349, 1.000)Interquartile Range 0.546 1.000 1.000 0.702 0.589 (0.179, 1.000)Kurtosis 0.145 1.000 1.000 0.249 0.716 (0.373, 1.000)Maximum 0.227 1.000 1.000 0.355 0.679 (0.274, 1.000)Mean 0.110 1.000 1.000 0.222 0.737 (0.354, 1.000)MAD 0.455 1.000 1.000 0.631 0.611 (0.195, 1.000)Median 0.110 1.000 1.000 0.222 0.737 (0.354, 1.000)Minimum 0.972 1.000 1.000 0.972 0.505 ( 0.234, 0.755)Range 0.374 1.000 1.000 0.561 0.632 (0.219, 1.000)Robust MAD 0.696 1.000 1.000 0.835 0.558 (0.123, 0.993)RMS 0.110 1.000 1.000 0.222 0.737 (0.354, 1.000)Skewness 0.017 0.621 0.621 0.222 0.853 (0.654, 1.000)Total Energy 0.110 1.000 1.000 0.222 0.737 (0.354, 1.000)Uniformity 0.110 1.000 1.000 0.222 0.737 (0.363, 1.000)Variance 0.455 1.000 1.000 0.631 0.611 (0.178, 1.000)
33
Table 4.5: p-values achieved with the Mann-Whitney U-Test (MW U-Test) before and after the imple-mentation of the multiple comparison corrections and the AUC value obtained using the ROC curve,with the respective 95% confidence interval, regarding each attribute of the post-1stcycle data set inscenario 2. These results were achieved using RapidMiner® and RStudio®. The p-values under the0.05 significance level are stressed in the color blue. The AUC values over 0.70 are stressed in the colororange.
Map Attributep-values
AUCMW U-Test Bonferroni Holm BH
ADC
10 Percentile 1.000 1.000 1.000 1.000 0.500 (0.239, 0.761)90 Percentile 0.120 1.000 1.000 0.420 0.684 (0.439, 0.930)Energy 0.493 1.000 1.000 0.550 0.581 (0.323, 0.840)Entropy 0.188 1.000 1.000 0.420 0.656 (0.414, 0.898)Interquartile Range 0.147 1.000 1.000 0.420 0.672 (0.428, 0.916)Kurtosis 0.292 1.000 1.000 0.420 0.625 (0.396, 0.854)Maximum 0.155 1.000 1.000 0.420 0.669 (0.416, 0.921)Mean 0.268 1.000 1.000 0.420 0.631 (0.384, 0.879)MAD 0.188 1.000 1.000 0.420 0.656 (0.409, 0.904)Median 0.292 1.000 1.000 0.420 0.625 (0.371, 0.879)Minimum 0.429 1.000 1.000 0.550 0.406 (0.308, 0.505)Range 0.155 1.000 1.000 0.420 0.669 (0.416, 0.921)Robust MAD 0.155 1.000 1.000 0.420 0.669 (0.423, 0.914)RMS 0.155 1.000 1.000 0.420 0.669 (0.420, 0.918)Skewness 0.493 1.000 1.000 0.550 0.581 (0.344, 0.818)Total Energy 0.493 1.000 1.000 0.550 0.581 (0.316, 0.847)Uniformity 0.206 1.000 1.000 0.420 0.650 (0.404, 0.896)Variance 0.188 1.000 1.000 0.420 0.656 (0.412, 0.901)
FF
10 Percentile 0.279 1.000 1.000 0.420 0.632 (0.407, 0.857)90 Percentile 0.279 1.000 1.000 0.420 0.632 (0.395, 0.869)Energy 0.128 1.000 1.000 0.420 0.686 (0.458, 0.913)Entropy 0.266 1.000 1.000 0.420 0.636 (0.404, 0.867)Interquartile Range 0.208 1.000 1.000 0.420 0.654 (0.427, 0.881)Kurtosis 0.040 1.000 1.000 0.420 0.750 (0.547, 0.953)Maximum 0.558 1.000 1.000 0.574 0.429 (0.226, 0.631)Mean 0.198 1.000 1.000 0.420 0.657 (0.425, 0.889)MAD 0.380 1.000 1.000 0.506 0.607 (0.366, 0.849)Median 0.169 1.000 1.000 0.420 0.668 (0.437, 0.899)Minimum 0.230 1.000 1.000 0.420 0.354 (0.197, 0.510)Range 0.520 1.000 1.000 0.550 0.579 (0.352, 0.805)Robust MAD 0.219 1.000 1.000 0.420 0.650 (0.423, 0.877)RMS 0.198 1.000 1.000 0.420 0.657 (0.425, 0.889)Skewness 0.380 1.000 1.000 0.506 0.607 (0.361, 0.853)Total Energy 0.178 1.000 1.000 0.420 0.664 (0.433, 0.895)Uniformity 0.520 1.000 1.000 0.550 0.579 (0.340, 0.817)Variance 0.447 1.000 1.000 0.550 0.593 (0.351, 0.835)
34
Table 4.6: p-values achieved with the Mann-Whitney U-Test (MW U-Test) before and after the imple-mentation of the multiple comparison corrections and the AUC value obtained using the ROC curve,with the respective 95% confidence interval, regarding each attribute of the post-1stcycle data set inscenario 3. These results were achieved using RapidMiner® and RStudio®. The p-values under the0.05 significance level are stressed in the color blue. The AUC values over 0.70 are stressed in the colororange.
p-valuesMap Attribute
MW U-Test Bonferroni Holm BHAUC
10 Percentile 0.573 1.000 1.000 0.688 0.573 (0.301, 0.844)90 Percentile 0.149 1.000 1.000 0.440 0.686 (0.437, 0.935)Energy 0.573 1.000 1.000 0.688 0.573 (0.306, 0.840)Entropy 0.439 1.000 1.000 0.688 0.600 (0.325, 0.875)Interquartile Range 0.460 1.000 1.000 0.688 0.595 (0.311, 0.880)Kurtosis 0.673 1.000 1.000 0.734 0.445 (0.186, 0.705)Maximum 0.291 1.000 1.000 0.551 0.636 (0.365, 0.907)Mean 0.526 1.000 1.000 0.688 0.582 (0.297, 0.867)MAD 0.205 1.000 1.000 0.492 0.664 (0.406, 0.921)Median 0.260 1.000 1.000 0.550 0.645 (0.37910.912)Minimum 1.000 1.000 1.000 1.000 0.500 (0.500, 0.500)Range 0.291 1.000 1.000 0.551 0.636 (0.365, 0.907)Robust MAD 0.481 1.000 1.000 0.688 0.591 (0.304, 0.878)RMS 0.159 1.000 1.000 0.440 0.682 (0.423, 0.941)Skewness 0.439 1.000 1.000 0.688 0.600 (0.334, 0.867)Total Energy 0.573 1.000 1.000 0.688 0.573 (0.298, 0.847)Uniformity 0.481 1.000 1.000 0.688 0.591 (0.309, 0.873)
ADC
Variance 0.526 1.000 1.000 0.688 0.582 (0.302, 0.862)10 Percentile 0.025 0.891 0.718 0.111 0.806 (0.607, 1.000)90 Percentile 0.016 0.576 0.496 0.096 0.828 (0.628, 1.000)Energy 0.003 0.118 0.118 0.065 0.900 (0.760, 1.000)Entropy 0.624 1.000 1.000 0.702 0.433 (0.146, 0.720)Interquartile Range 0.253 1.000 1.000 0.550 0.344 (0.083, 0.606)Kurtosis 0.102 1.000 1.000 0.410 0.722 (0.481, 0.964)Maximum 0.142 1.000 1.000 0.440 0.700 (0.540, 0.860)Mean 0.568 1.000 1.000 0.688 0.422 (0.133, 0.711)MAD 0.009 0.323 0.296 0.065 0.856 (0.680, 1.000)Median 0.006 0.224 0.218 0.065 0.872 (0.707, 1.000)Minimum 0.153 1.000 1.000 0.440 0.694 (0.497, 0.892)Range 0.775 1.000 1.000 0.821 0.461 (0.194, 0.728)Robust MAD 0.191 1.000 1.000 0.492 0.678 (0.421, 0.935)RMS 0.009 0.323 0.296 0.065 0.856 (0.680, 1.000)Skewness 0.022 0.801 0.667 0.111 0.811 (0.594, 1.000)Total Energy 0.007 0.254 0.240 0.065 0.867 (0.698, 1.000)Uniformity 0.935 1.000 1.000 0.962 0.511 (0.234, 0.789)
FF
Variance 0.624 1.000 1.000 0.702 0.433 (0.143, 0.724)
35
Table 4.7: p-values achieved from the Mann-Whitney U-Test for each attribute of interest before andafter the implementation of the multiple comparison corrections and the AUC value obtained using theROC curve, with the respective 95% confidence interval, regarding each attribute of the delta data setin scenario 1. These results were achieved using RapidMiner® and RStudio®. The p-values under the0.05 significance level are stressed in the color blue. The AUC values over 0.70 are stressed in the colororange.
Map Attributep-values
AUCMW U-Test Bonferroni Holm BH
ADC
10 Percentile 0.495 1.000 1.000 0.974 0.400 (0.072, 0.728)90 Percentile 0.974 1.000 1.000 0.974 0.495 (0.170, 0.821)Energy 0.495 1.000 1.000 0.974 0.600 (0.233, 0.967)Entropy 0.974 1.000 1.000 0.974 0.505 (0.190, 0.819)Interquartile Range 0.255 1.000 1.000 0.974 0.667 (0.430, 0.903)Kurtosis 0.626 1.000 1.000 0.974 0.571 (0.232, 0.911)Maximum 0.537 1.000 1.000 0.974 0.590 (0.280, 0.901)Mean 0.922 1.000 1.000 0.974 0.486 (0.130, 0.842)MAD 0.205 1.000 1.000 0.974 0.686 (0.443, 0.929)Median 0.820 1.000 1.000 0.974 0.533 (0.170, 0.897)Minimum 0.845 1.000 1.000 0.974 0.471 (0.108, 0.949)Range 0.416 1.000 1.000 0.974 0.619 (0.343, 0.895)Robust MAD 0.229 1.000 1.000 0.974 0.676 (0.441, 0.912)RMS 0.871 1.000 1.000 0.974 0.524 (0.162, 0.886)Skewness 0.380 1.000 1.000 0.974 0.629 (0.310, 0.948)Total Energy 0.454 1.000 1.000 0.974 0.610 (0.235, 0.984)Uniformity 0.720 1.000 1.000 0.974 0.448 (0.115, 0.780)Variance 0.064 1.000 1.000 0.974 0.771 (0.585, 0.958)
FF
10 Percentile 0.737 1.000 1.000 0.974 0.550 (0.216, 0.884)90 Percentile 0.766 1.000 1.000 0.974 0.544 (0.120, 0.969)Energy 0.180 1.000 1.000 0.974 0.700 (0.337, 1.000)Entropy 0.456 1.000 1.000 0.974 0.611 (0.239, 0.984)Interquartile Range 0.205 1.000 1.000 0.974 0.689 (0.324, 1.000)Kurtosis 0.823 1.000 1.000 0.974 0.533 (0.117, 0.950)Maximum 0.766 1.000 1.000 0.974 0.456 (0.036, 0.875)Mean 0.941 1.000 1.000 0.974 0.511 (0.120, 0.903)MAD 0.157 1.000 1.000 0.974 0.711 (0.344, 1.000)Median 0.941 1.000 1.000 0.974 0.489 (0.107, 0.871)Minimum 0.794 1.000 1.000 0.974 0.539 (0.225, 0.853)Range 0.766 1.000 1.000 0.974 0.544 (0.102, 0.987)Robust MAD 0.118 1.000 1.000 0.974 0.733 (0.360, 1.000)RMS 0.881 1.000 1.000 0.974 0.522 (0.124, 0.920)Skewness 0.502 1.000 1.000 0.974 0.600 (0.226, 0.974)Total Energy 0.205 1.000 1.000 0.974 0.689 (0.327, 1.000)Uniformity 0.823 1.000 1.000 0.974 0.533 (0.097, 0.970)Variance 0.297 1.000 1.000 0.974 0.656 (0.279, 1.000)
36
Table 4.8: p-values achieved from the Mann-Whitney U-Test for each attribute of interest before andafter the implementation of the multiple comparison corrections and the AUC value obtained using theROC curve, with the respective 95% confidence interval, regarding each attribute of the delta data setin scenario 2. These results were achieved using RapidMiner® and RStudio®. The p-values under the0.05 significance level are stressed in the color blue. The AUC values over 0.70 are stressed in the colororange.
Map Attributep-values
AUCMW U-Test Bonferroni Holm BH
ADC
10 Percentile 0.236 1.000 1.000 0.467 0.638 (0.397, 0.878)90 Percentile 0.061 1.000 1.000 0.184 0.725 (0.509, 0.941)Energy 0.635 1.000 1.000 0.673 0.556 (0.321, 0.792)Entropy 0.082 1.000 1.000 0.227 0.706 (0.479, 0.933)Interquartile Range 0.029 1.000 0.890 0.154 0.759 (0.551, 0.975)Kurtosis 0.635 1.000 1.000 0.673 0.556 (0.306, 0.806)Maximum 0.051 1.000 1.000 0.167 0.731 (0.524, 0.939)Mean 0.562 1.000 1.000 0.673 0.569 (0.327, 0.811)MAD 0.011 0.411 0.411 0.154 0.800 (0.589, 1.000)Median 0.958 1.000 1.000 0.958 0.506 (0.250, 0.762)Minimum 0.461 1.000 1.000 0.638 0.413 (0.250, 0.575)Range 0.040 1.000 1.000 0.159 0.744 (0.534, 0.953)Robust MAD 0.018 0.637 0.584 0.154 0.781 (0.566, 0.996)RMS 0.171 1.000 1.000 0.384 0.663 (0.423, 0.902)Skewness 0.140 1.000 1.000 0.336 0.675 (0.454, 0.896)Total Energy 0.461 1.000 1.000 0.638 0.588 (0.351, 0.824)Uniformity 0.246 1.000 1.000 0.467 0.638 (0.400, 0.875)Variance 0.013 0.477 0.464 0.154 0.794 (0.588, 0.999)
FF
10 Percentile 0.014 0.515 0.486 0.154 0.804 (0.619, 0.989)90 Percentile 0.038 1.000 1.000 0.159 0.758 (0.548, 0.967)Energy 0.852 1.000 1.000 0.877 0.523 (0.274, 0.772)Entropy 0.457 1.000 1.000 0.638 0.592 (0.342, 0.843)Interquartile Range 0.556 1.000 1.000 0.673 0.573 (0.316, 0.830)Kurtosis 0.577 1.000 1.000 0.673 0.569 (0.300, 0.839)Maximum 0.215 1.000 1.000 0.455 0.654 (0.461, 0.847)Mean 0.022 0.783 0.696 0.154 0.785 (0.584, 0.985)MAD 0.264 1.000 1.000 0.476 0.638 (0.392, 0.885)Median 0.047 1.000 1.000 0.167 0.746 (0.534, 0.958)Minimum 0.598 1.000 1.000 0.673 0.565 (0.362, 0.769)Range 0.620 1.000 1.000 0.673 0.562 (0.324, 0.800)Robust MAD 0.321 1.000 1.000 0.525 0.623 (0.376, 0.870)RMS 0.030 1.000 0.899 0.154 0.769 (0.563, 0.975)Skewness 0.107 1.000 1.000 0.275 0.700 (0.464, 0.936)Total Energy 0.577 1.000 1.000 0.673 0.569 (0.321, 0.817)Uniformity 0.321 1.000 1.000 0.525 0.623 (0.384, 0.862)Variance 0.385 1.000 1.000 0.603 0.608 (0.362, 0.853)
37
Table 4.9: p-values achieved from the Mann-Whitney U-Test for each attribute of interest before andafter the implementation of the multiple comparison corrections and the AUC value obtained using theROC curve, with the respective 95% confidence interval, regarding each attribute of the delta data setin scenario 3. These results were achieved using RapidMiner® and RStudio®. The p-values under the0.05 significance level are stressed in the color blue. The AUC values over 0.70 are stressed in the colororange.
p-valuesMap Attribute
MW U-Test Bonferroni Holm BHAUC
10 Percentile 0.324 1.000 1.000 0.556 0.627 (0.364, 0.891)90 Percentile 0.035 1.000 1.000 0.139 0.773 (0.549, 0.997)Energy 0.888 1.000 1.000 0.913 0.518 (0.252, 0.784)Entropy 0.067 1.000 1.000 0.173 0.736 (0.501, 0.972)Interquartile Range 0.049 1.000 1.000 0.159 0.755 (0.534, 0.976)Kurtosis 0.481 1.000 1.000 0.619 0.591 (0.321, 0.861)Maximum 0.067 1.000 1.000 0.173 0.736 (0.507, 0.966)Mean 0.029 1.000 0.871 0.139 0.782 (0.549, 1.000)MAD 0.439 1.000 1.000 0.619 0.600 (0.336, 0.864)Median 0.778 1.000 1.000 0.849 0.536 (0.253, 0.819)Minimum 0.439 1.000 1.000 0.619 0.400 (0.269, 0.531)Range 0.067 1.000 1.000 0.173 0.736 (0.507, 0.966)Robust MAD 0.035 1.000 1.000 0.139 0.773 (0.548, 0.997)RMS 0.105 1.000 1.000 0.253 0.709 (0.452, 0.967)Skewness 0.205 1.000 1.000 0.369 0.664 (0.408, 0.919)Total Energy 0.622 1.000 1.000 0.739 0.564 (0.300, 0.828)Uniformity 0.139 1.000 1.000 0.313 0.691 (0.448, 0.934)
ADC
Variance 0.041 1.000 1.000 0.148 0.764 (0.538, 0.990)10 Percentile 0.003 0.105 0.105 0.053 0.919 (0.792, 1.000)90 Percentile 0.004 0.140 0.136 0.053 0.906 (0.762, 1.000)Energy 0.534 1.000 1.000 0.663 0.588 (0.293, 0.882)Entropy 0.183 1.000 1.000 0.346 0.688 (0.424, 0.951)Interquartile Range 0.824 1.000 1.000 0.873 0.531 (0.245, 0.818)Kurtosis 0.374 1.000 1.000 0.612 0.625 (0.333, 0.917)Maximum 0.155 1.000 1.000 0.329 0.300 (0.140, 0.460)Mean 0.477 1.000 1.000 0.619 0.600 (0.319, 0.881)MAD 0.004 0.161 0.152 0.053 0.900 (0.741, 1.000)Median 0.023 0.845 0.727 0.139 0.819 (0.619, 1.000)Minimum 0.450 1.000 1.000 0.619 0.394 (0.141, 0.646)Range 0.450 1.000 1.000 0.619 0.606 (0.348, 0.865)Robust MAD 0.657 1.000 1.000 0.739 0.563 (0.274, 0.851)RMS 0.006 0.212 0.194 0.053 0.888 (0.720, 1.000)Skewness 0.021 0.752 0.668 0.139 0.825 (0.609, 1.000)Total Energy 0.929 1.000 1.000 0.929 0.513 (0.214, 0.811)Uniformity 0.183 1.000 1.000 0.346 0.688 (0.429, 0.946)
FF
Variance 0.657 1.000 1.000 0.739 0.563 (0.278, 0.847)
38
The AUC determined from a ROC curve is representative of how good an attribute may be. Neverthe-
less, the same AUC can be achieved with different traced curves. The points (TPR, FPR) that build the
ROC curve of each attribute of interest need to undergo further examinationin order to find the variables
that display the best balance between sensitivity and specificity. There is the general goal of maximizing
the TPR while minimizing the FPR, which would correspond to a point close to the upper left corner of
the graphic.
In Table 4.10 are the attributes which are considered to have the best TPR and FPR balance given
the problem at hands (obtained by the method briefly in the Subsection 3.3.3), the threshold at which
these rates occur, as well as other performance metrics associated with each attribute performance
(accuracy, precision and F1-measure).
It is important to mention that only attributes that remained statistically significant after the multiple
correction analysis are presented in this section. The imaging features found interesting when consid-
ering the ROC curve alone are displayed in the Appendix D.1 along with their associated performance
metrics.
The attributes depicted in Table 4.10 will be referred as key attributes from this point on. The ROC
curve associated with each key attribute presented in Table 4.10 is available in the Appendix D.2. The
second part of the Table 4.10 contains the AUC values, with the respective 95% confidence interval, and
the p-values before and after the multiple comparison corrections for each key attribute.
Table 4.10: Summary of the attributes considered statically relevant through the p-value evaluation andthe ROC curve with their respective true positive rate (TPR), false positive rate (FPR), threshold at whichthis rates are verified, accuracy, precision, F1-measure (F1), AUC (with the respective 95% confidenceinterval) and p-values before and after the multiple comparison tests.
Data set Scenario Attribute TPR FPR Threshold Accuracy Precision F1
ADC PreTreat 1 Kurtosis 0.667 0.087 3.30 0.862 0.667 0.667
FF PreTreat 1
90 Percentile 0.667 0.091 19.0 0.857 0.667 0.667
Median 0.833 0.182 11.0 0.821 0.556 0.667
RMS 0.833 0.182 17.4 0.821 0.556 0.667
Skewness 0.833 0.182 0.624 0.821 0.556 0.667
Total Energy 0.833 0.182 2.72×107 0.821 0.556 0.667
Data set Scenario Attribute AUCp-value
MW U-Test Bonferroni Holm BH
ADC PreTreat 1 Kurtosis 0.855 (0.679, 1.000) 0.008 0.300 0.283 0.042
FF PreTreat 1
90 Percentile 0.879 (0.747, 1.000) 0.005 0.184 0.184 0.042
Median 0.856 (0.698, 1.000) 0.009 0.306 0.283 0.042
RMS 0.856 (0.704, 1.000) 0.009 0.306 0.283 0.042
Skewness 0.856 (0.702, 1.000) 0.009 0.306 0.283 0.042
Total Energy 0.864 (0.703, 1.000) 0.007 0.259 0.252 0.042
One may observe that in Table 4.10 there are two attributes with the same TPR (0.667) but distinct
39
FPR. The attribute kurtosis from the ADC map displays a slightly inferior FPR when compared to the
attribute 90 percentile from the FF map (-0.004). Nonetheless, the attribute 90 percentile displays the
one of the best AUC from all the attributes evaluated, particularly when considering the confidence
interval. With these aspects considered, it was decided by the author to further explore these two key
attributes.
From the remaining 8 attributes that survived the Benjamini-Hochberg correction, all presented a
threshold at which the TPR is equal to 0.833 and the FPR is equal to 0.182. The criteria that led to
the selection of the 4 attributes depicted in Table 4.10 with this TPR and FPR was the AUC value and
the p-value after the Mann-Whitney U-Test. The attributes selected have an AUC over 0.850, with a
confidence interval ranging from 0.70 to 1.00, and have an associated p-value under 0.01.
Through Table 4.10 it is possible to see that the increase of the TPR is associated with the increase
in the FPR.
The contribution of the p-value and ROC curve allows the inference of which attributes and which
data sets could be more robust in distinguishing both classes (responders vs non-responders).
Since all the key attributes belong to scenario 1, the initial main goal would be to minimize the FPR, in
other words, it would be to avoid the misclassification of a responsive patient as a non-responder, as this
mistake could lead to the interruption of a working treatment. The class non-responders is composed by
6 patients in both data sets, while the class responders is composed by 22 or 23 patients, depending if
one is considering the FF or ADC data set, respectively. The TPR of 0.833 and 0.667 correspond to the
misclassification of 1 and 2 patients, respectively, that belong to the non-responders class. While, the
FPR of 0.087 or 0.091 and 0.182 correspond to the misclassification of 2 and 4 patients, respectively,
that belong to the responders population. What one should consider is if the misclassification of one
more patient as responders is better than misclassifying two patients more as non-responders. In the
first case, one more patient may undergo unnecessary debilitating procedures, as on the second case,
two patients would have the treatment adjusted, which could lead to a step back in the fight against the
disease. The other performance metrics (accuracy, precision and F1-measure), show better results in
the first case (TPR equal to 0.667).
4.1.2 Detailed Analysis of the Key Attributes
The patients are divided between responders and non-responders. The evaluation of class separa-
tion considering the median of each population was translated in the p-values attained with the Mann-
Whitney U-Test. For the key attributes is explicitly shown in Table 4.11 the mean and standard deviation
of each group, as well as the range of values taken by each variable.
It is expected, considering the anterior analysis made in the Subsection 4.1.1, that these attributes
display a good separation between the two populations. Considering the Table 4.11, it is possible to have
an idea of the distance between the means of each population, as well as an notion of class dispersion
when the standard deviation is added and the range considered.
A well known method for graphically depicting groups of numerical data distribution through their
quartiles is the Box and Whiskers plot. This facilitates the visualization of the class values’ separation
40
Table 4.11: Mean associated with the key attributes for each class (responders and non-responders),with the respective standard deviation (SD), and the range of values within each key attribute is located.
Data set Key attributeMean ± SD
RangeResponders Non-Responders
ADC PreTreat Kurtosis 2.82± 0.34 3.50± 0.62 (2.12, 4.56)
FF PreTreat
90 Percentile 38.3± 11.4 19.8± 12.3 (5.0, 47.0)
Median 27.1± 13.7 6.8± 11.0 (0, 41)
RMS 28.7± 10.7 17.7± 8.9 (4.12, 40.43)
Skewness −0.22± 1.07 1.60± 1.21 (−1.22, 3.47)
Total Energy 1.40× 108 ± 9.43× 107 3.30× 107 ± 5.34× 107 (1.49× 106, 2.49× 108)
(location), as well as it might enable the evaluation in each individual population of outliers’ existence
and data dispersion, for instance. The box and whisker plots for the key attributes are depicted in Figure
4.1.
The box represents the observations in between the lower (Q1, 25th percentile) and upper quartile
(Q3, 75th percentile), which is denominated as the interquartile range (IQR = Q3 − Q1). The median
(Q2, 50th percentile) is traced inside it. The whiskers are defined as the interval of variability of the
observations. The upper extreme can be defined as minimum between Q3+1.5 ·IQR and the maximum
value of the class observations and the lower extreme can be defined as maximum between Q1 − 1.5 ·
IQR and the minimum value of the class observations. Outside the whiskers, the observations are
considered outliers.
All the values for the median, upper and lower quartiles and upper and lower extremes in each case
are presented in Table D.2 (Appendix D.3).
It is important to keep in mind the discrepancy between the number of patients in each class in
order to have a better understanding of the results depicted in each box and whisker plot. Since all
the key attributes belong to scenario 1, the class responders corresponds to approximately 80% of the
patients present in this study, while the remaining 20% belong to the class non-responders, as mentioned
previously. This fact can be particularly relevant when analysing class dispersion.
As previously mentioned, the box and whisker plot is a graphic representation of the data distribution.
As a general appreciation, one may say that all the box and whisker plots appear to be in accordance
with the results previously found, which indicate that these attributes have discriminatory power when
distinguishing the two populations (responders and non-responders).
Although there is no attribute that shows a perfect separation between classes, one might observe
that the key attributes seem to display a good separation. All the key attributes in scenario 1 referent to
the FF map on the pre-treatment stage show an apparent absence of common values in the IQR.
Through the box and whisker plot (Figure 4.1), it appears that the classes have distinct distributions
and locations when considering the key attribute 90 percentile (FF PreTreat 1). The distribution re-
garding the non-responders class is broader when compared with the responders class. The interval
of signal intensities from a FF map ranges from 0 to 50. There is an apparent distinct tendency toward
lower observation values of this attribute from the non-responders class and toward higher observation
41
Figure 4.1: Comparison of differences in signal intensity parameters collected from ADC and FF mapsbetween responders and non-responders for the key attribute A Kurtosis in the ADC pre-treatment dataset in scenario 1 (ADC PreTreat 1), B 90 Percentile in the FF pre-treatment data set in scenario 1(FF PreTreat 1), C Median in the FF pre-treatment data set in scenario 1, D Skewness in the FF pre-treatment data set in scenario 1, E Root Mean Squared in the FF pre-treatment data set in scenario 1and F Total Energy in the FF pre-treatment data set in scenario 1. The boundaries of the box show 25th(Q1) and 75th (Q3) percentiles, and the line within the box is the median. The whiskers are defined asthe interval of variability of the observations. The upper extreme can be defined as minimum betweenQ3 + 1.5 · IQR and the maximum value of the class observations and the lower extreme can be definedas maximum between Q1 − 1.5 · IQR and the minimum value of the class observations.
42
values from the responders class.
Through the box and whisker plots (Figure 4.1), it appears that the classes have distinct distributions
and locations when considering individually the key attributes median, robust mean squared and total
energy (FF PreTreat 1). The classes in these three graphics seem to display a similar relative position,
with the non-responders group tending to lower observation values, while the responders group appears
to display a tendency to higher observation values, similarly to what happens with the 90 percentile
attribute. In addition, the distribution in the responders group appears to be broader when compared to
the non-responders group. There is a easily identifiable outlier in the non-responders group in all these
three plots and it belong to the same patient (patient 18, who was classified as having a stable disease).
Through the box and whisker plot (Figure 4.1), it appears that the classes have distinct locations when
considering the key attribute skewness (FF PreTreat 1). The skewness from the non-responders class
comprehends mostly positive values, which is an indication of a positively skewed distribution, specially
if the values are superior to 0.5. The skewness from the responders class comprehends mostly negative
values, which is an indication of a negatively skewed distribution, specially if the values are inferior to -
0.5.
43
4.1.3 Clinical Variables
The univariate analysis developed in this work has its the main focus on the preliminary discovery of an
accurate imaging biomarker in the prediction of treatment response status in an early phase of induction
chemotherapy. Nevertheless, it was thought that the available clinical data from the patients in this study
might also be a source of possible biomarkers.
Identically to the procedure followed when analysing the imaging features, the clinical data set was
submitted to the Mann-Whitney U-Test and to the multiple comparison corrections, as well as to the AUC
evaluation in all scenarios (1, 2 and 3). The results are depicted in Tables 4.12 to 4.14.
Table 4.12: p-values achieved from the Mann-Whitney U-Test for each attribute of interest before andafter the implementation of the multiple comparison corrections and the AUC value obtained using theROC curve, with the respective 95% confidence interval, regarding each attribute of the clinical dataset in scenario 1. These results were achieved using RapidMiner® and RStudio®. The p-values underthe 0.05 significance level are stressed in the color blue. The AUC values over 0.70 are stressed in thecolor orange.
Attributep-values
AUCMW U-Test Bonferroni Holm BH
Age at Rx 0.351 1.000 1.000 0.576 0.625 (0.301, 0.949)Weight 0.014 0.317 0.290 0.091 0.830 (0.668, 0.992)Height 0.074 1.000 1.000 0.212 0.740 (0.515, 0.964)Gender 0.120 1.000 1.000 0.306 0.708 (0.516, 0.900)ECOGPS 0.736 1.000 1.000 0.770 0.545 (0.286, 0.804)Hb 0.016 0.366 0.318 0.091 0.823 (0.630, 1.000)WBC 0.337 1.000 1.000 0.576 0.628 (0.390, 0.867)PMNs 0.517 1.000 1.000 0.646 0.587 (0.299, 0.875)Platelet 0.378 1.000 1.000 0.580 0.618 (0.280, 0.956)Albumin 0.452 1.000 1.000 0.633 0.601 (0.309, 0.893)Creatinine 0.641 1.000 1.000 0.702 0.563 (0.277, 0.849)Urea 0.534 1.000 1.000 0.646 0.583 (0.352, 0.815)Calcium 0.641 1.000 1.000 0.702 0.563 (0.263, 0.862)Alkaline Phosphatase 0.468 1.000 1.000 0.633 0.597 (0.291, 0.903)β2M 0.023 0.518 0.428 0.104 0.806 (0.579, 1.000)LDH 0.005 0.118 0.113 0.059 0.906 (0.718, 1.000)Actual Percentage 0.058 1.000 0.993 0.192 0.777 (0.587, 0.967)Serum Peak 0.325 1.000 1.000 0.576 0.632 (0.337, 0.927)Serum Immunoglobulin A 0.001 0.030 0.030 0.030 0.452 (0.200, 0.705)Serum Kappa Free 0.325 1.000 1.000 0.509 0.569 (0.275, 0.863)Serum Lambda Free 0.325 1.000 1.000 1.000 0.623 (0.328, 0.918)Ratioκλ 0.325 1.000 1.000 0.428 0.602 (0.306, 0.899)ISS 0.038 0.876 0.686 0.146 0.778 (0.612, 0.944)
In the clinical data 6 attributes show statistical significance in scenario 1 (weight, Hb, β2M , LDH,
serum immunoglobulin A and ISS), 1 attribute is considered statistically significant in scenario 2 (weight)
and 1 attribute displays statistical relevance in scenario 3 (serum lambda free) after the Mann-Whitney
U-Test. Only the attribute serum immunoglobulin A remains statistically significant when submitted to
the multiple comparison corrections. Some attributes displays an AUC over 0.70, 7 attributes in scenario
1 (weight, height, gender, Hb, β2M , LDH, and ISS), 1 in scenario 2 (weight) and 3 in scenario 3 (serum
44
Table 4.13: p-values achieved from the Mann-Whitney U-Test for each attribute of interest before andafter the implementation of the multiple comparison corrections and the AUC value obtained using theROC curve, with the respective 95% confidence interval, regarding each attribute of the clinical dataset in scenario 2. These results were achieved using RapidMiner® and RStudio®. The p-values underthe 0.05 significance level are stressed in the color blue. The AUC values over 0.70 are stressed in thecolor orange.
Attributep-values
AUCMW U-Test Bonferroni Holm BH
Age at Rx 0.525 1.000 1.000 0.837 0.431 (0.208, 0.653)Weight 0.044 1.000 1.000 0.653 0.720 (0.5334, 0.9064)Height 0.472 1.000 1.000 0.837 0.421 (0.200, 0.642)Gender 0.525 1.000 1.000 0.837 0.569 (0.382, 0.757)ECOGPS 0.933 1.000 1.000 0.933 0.491 (0.287, 0.694)Hb 0.498 1.000 1.000 0.837 0.574 (0.353, 0.795)WBC 0.933 1.000 1.000 0.933 0.509 (0.284, 0.735)PMNs 0.719 1.000 1.000 0.933 0.539 (0.323, 0.756)Platelet 0.320 1.000 1.000 0.837 0.609 (0.389, 0.828)Albumin 0.525 1.000 1.000 0.837 0.569 (0.344, 0.795)Creatinine 0.446 1.000 1.000 0.837 0.583 (0.363, 0.804)Urea 0.421 1.000 1.000 0.837 0.588 (0.373, 0.803)Calcium 0.882 1.000 1.000 0.933 0.484 (0.251, 0.716)Alkaline Phosphatase 0.849 1.000 1.000 0.933 0.521 (0.303, 0.739)β2M 0.446 1.000 1.000 0.837 0.583 (0.370, 0.797)LDH 0.899 1.000 1.000 0.933 0.444 (0.222, 0.667)Actual Percentage 0.899 1.000 1.000 0.933 0.550 (0.306, 0.794)Serum Peak 0.568 1.000 1.000 0.837 0.563 (0.343, 0.782)Serum Immunoglobulin A 0.525 1.000 1.000 0.837 0.587 (0.340, 0.834)Serum Kappa Free 0.352 1.000 1.000 0.837 0.646 (0.440, 0.853)Serum Lambda Free 0.057 1.000 1.000 0.653 0.672 (0.469, 0.875)Ratio κλ 0.409 1.000 1.000 0.837 0.671 (0.467, 0.875)ISS 0.582 1.000 1.000 0.837 0.560 (0.354, 0.766)
kappa free, serum lambda free and ratio κ λ). It is worth mentioning, that the only attribute that survives
the multiple comparison corrections does not have an AUC over 0.70, in fact the serum immunoglobulin
A displays a low AUC value equal to 0.452.
From the attributes in the clinical data set, all of those with AUC over 0.70 were evaluated in order to
find the ones that display the best balance between TPR and FPR. In Table 4.15, similarly to what was
done for the imaging features, it is summarize the p-values and AUC values, as well as all the statistical
parameters of interest, such as the true positive rate, false positive rate, threshold, accuracy, precision
and F1-measure. In this particular case, the attribute LDH displayed a better performance than any other
attribute.
It is worth mentioning, that, although the attribute LDH displays very good performance metrics, it
does not survive the multiple comparison corrections, therefore there is a high probability that it showed
statistical significance after the Mann-Whitney U-test only by chance.
For both attributes, LDH and serum immnunoglobulin A, in Table 4.16 is presented the mean and
standard deviation with respect to each class, as well as the range of values taken by each variable, and
45
Table 4.14: p-values achieved from the Mann-Whitney U-Test for each attribute of interest before andafter the implementation of the multiple comparison corrections and the AUC value obtained using theROC curve, with the respective 95% confidence interval, regarding each attribute of the clinical dataset in scenario 3. These results were achieved using RapidMiner® and RStudio®. The p-values underthe 0.05 significance level are stressed in the color blue. The AUC values over 0.70 are stressed in thecolor orange.
Attributep-values
AUCMW U-Test Bonferroni Holm BH
Age at Rx 0.840 1.000 1.000 0.997 0.476 (0.228, 0.724)Weight 0.260 1.000 1.000 0.997 0.635 (0.403, 0.868)Height 0.885 1.000 1.000 0.997 0.483 (0.237, 0.728)Gender 1.000 1.000 1.000 1.000 0.500 (0.294, 0.706)ECOGPS 0.954 1.000 1.000 0.997 0.507 (0.284, 0.730)Hb 0.729 1.000 1.000 0.997 0.542 (0.293, 0.790)WBC 0.908 1.000 1.000 0.997 0.514 (0.266, 0.762)PMNs 0.470 1.000 1.000 0.997 0.587 (0.348, 0.825)Platelet 0.141 1.000 1.000 0.810 0.677 (0.446, 0.909)Albumin 0.665 1.000 1.000 0.997 0.552 (0.307, 0.798)Creatinine 0.488 1.000 1.000 0.997 0.583 (0.340, 0.826)Urea 0.273 1.000 1.000 0.997 0.632 (0.385, 0.879)Calcium 0.840 1.000 1.000 0.997 0.476 (0.226, 0.726)Alkaline Phosphatase 0.840 1.000 1.000 0.997 0.524 (0.276, 0.772)β2M 0.817 1.000 1.000 0.997 0.528 (0.276, 0.779)LDH 0.931 1.000 1.000 0.997 0.610 (0.357, 0.863)Actual Percentage 0.544 1.000 1.000 0.997 0.454 (0.193, 0.716)Serum Peak 0.840 1.000 1.000 0.997 0.524 (0.277, 0.772)Serum Immunoglobulin A 0.908 1.000 1.000 0.997 0.582 (0.326, 0.837)Serum Kappa Free 0.119 1.000 1.000 0.810 0.705 (0.481, 0.928)Serum Lambda Free 0.006 0.128 0.128 0.128 0.765 (0.566, 0.965)Ratio κλ 0.126 1.000 1.000 0.810 0.764 (0.556, 0.973)ISS 0.686 1.000 1.000 0.997 0.652 (0.437, 0.867)
in Figure 4.2 are displayed the box and whiskers plot that depicts the separation of both populations. All
the values for the median, upper and lower quartiles and upper and lower extremes in each case are
presented in Table D.2 (Appendix D.3).
Both the data presented in Table 4.16 and the box and whisker plot presented in Figure 4.2A seem
to support a good class division for the attribute LDH. The interval of LDH observations ranges between
100 and 277. In the box and whisker plot, both populations present distinct distributions and locations.
The non-responders class appears to comprehend overall higher values and a more dispersed collection
of observations when compared to the responders class. One may say that there exists a significant
separation among classes. Nevertheless, it is important to keep in mind that the non-responders class
is constituted by 6 patients, while the responders class is composed by 23 patients. The class imbalance
may contribute to data artifacts
As it is possible to visualize in the box and whisker plot which depicts the attribute serum immonu-
globulin A (Figure 4.1B), in the class responders there is a considerable amount of outliers. In order to
better understand the information present in the box and whiskers plot, the graphic was amplified (Figure
46
Table 4.15: Summary of the attributes considered statically relevant through the Mann-Whitney U-Testand the ROC curve with their respective true positive rate (TPR), false positive rate (FPR), threshold atwhich this rates are verified, accuracy, precision, F1-measure (F1), AUC score with the respective 95%confidence interval and p-values before and after the multiple comparison tests.
Data set Scenario Attribute TPR FPR Threshold Accuracy Precision F1
Clinical 1 LDH 0.833 0 195.5 0.966 1.000 0.909
Data set Scenario Attribute AUCp-value
Mann-Whitney U-Test Bonferroni Holm BH
Clinical 1 LDH 0.906 (0.718-1) 0.005 0.118 0.113 0.059
Table 4.16: Mean associated with the clinical attributes LDH and serum immunoglobulin A (Serum IgA)for each class (responders and non-responders), with the respective standard deviation (SD), and therange of values within each attribute is located.
AttributeMean
RangeResponders Non-Responders
LDH 151± 23 224± 45 100− 277
Serum IgA 34± 8 226± 670 23− 3180
Figure 4.2: Box and whisker plot for the clinical attributes A LDH and B Serum Immunoglobulin A (SerumIgA) in scenario 1. The boundaries of the box show 25th (Q1) and 75th (Q3) percentiles, and the linewithin the box is the median. The whiskers are defined as the interval of variability of the observations.The upper extreme can be defined as minimum between Q3 + 1.5 · IQR and the maximum value of theclass observations and the lower extreme can be defined as maximum between Q1 − 1.5 · IQR and theminimum value of the class observations.
4.3).
It is not possible to visualize a good class separation for the serum immunoglobulin A in the am-
plified box and whiskers plot (Figure 4.3). The good results obtained in the p-value evaluation may be
influenced by the presence of outliers in the class responders (3 outliers), by the lack of data regarding
5 patients and by the class imbalance that characterizes the data set in this scenario (only 16% of the
patients are non-responders, 4 out of 25).
47
Figure 4.3: Box and whisker plot for the clinical attribute Serum Immunoglobulin A (Serum IgA) in sce-nario 1 amplified. The boundaries of the box show 25th (Q1) and 75th (Q3) percentiles, and the linewithin the box is the median. The whiskers are defined as the interval of variability of the observations.The upper extreme can be defined as minimum between Q3 + 1.5 · IQR and the maximum value of theclass observations and the lower extreme can be defined as maximum between Q1 − 1.5 · IQR and theminimum value of the class observations.
4.2 Density Plots Comparative Analysis Over the First Round of
Chemotherapy
The classification of the response to treatment among patients ranges from 1 to 6 with distinct levels
of response. One may think that these patients might show different patterns in the data collected
according with their response to treatment. There exists the hypothesis that a patient with complete
response to treatment or progressive disease would show a greater change in the aspect of its’ spine
vertebrae through the MRI images than a patient with partial response or stable disease. In other words,
an increased variation in the signal intensity extracted from the ADC and FF maps would be registered
in cases with a more significant change in the spine condition.
The probability density function specifies for a continuous random variable the distribution of the
density over the range of values taken by the variable. This function can be graphically depicted and it
constitutes an accurate representation of the distribution of the numerical data.
The density plots displayed in Figures 4.4 through 4.9 depict the overlay of the signal intensity distri-
bution collected from the ADC and FF maps before and after the first round of the chemotherapy from 3
different patients (2, 6 and 22), all with a different classification in the response to treatment according
to the IMWG guidelines (6, 4 and 1, respectively).
The characteristics that describe the distribution of the signal intensities measured directly from the
parametric maps in the two moments of image acquisition are summarized in Table 4.17.
From Table 4.17 it is possible to observe that the kurtosis regarding patient 2 and 6 is always higher
than 3 (5.15 to 17.8), while the kurtosis concerning patient 22 is always below 3 (1.97 - 2.66). A kurtosis
equal to 3 corresponds to a normal distribution of the data, as mention previously on section 2.4. These
results indicate that the signal distributions from the patients 2 and 6 present longer and broader tails
when compared with a normal distribution, which indicates the scatterness of the data; there is an
48
Table 4.17: Statistical metrics summary that characterize the density plots depicted in the images 4.4to 4.9, namely: mean ± standard deviation (SD), kurtosis (kurt), skewness (skew), median, minimum(min) and maximum (max); with the respective patient’s identification, the map and phase of treatmentto which the signal intensities evaluated correspond to, the classification of the response to treatmentand Figure (fig) identification from the respective density plot.
Patient
IDMap
Treatment
phase
Statistical Metrics Response to
TreatmentFig
Mean ± SD Kurt Skew Median Min Max
2
ADCPre 295 ± 86 9.27 1.65 281 0 962
6
4.4Post 340 ± 114 5.44 1.29 316 0 953
FFPre 8.25 ± 9.63 6.35 1.83 6.00 0 50.0
4.5Post 3.69 ± 6.19 17.8 3.43 2.00 0 50.0
6
ADCPre 189 ± 149 5.15 1.24 168 0 975
4
4.6Post 147 ± 162 6.96 1.93 108 0 886
FFPre 38.0 ± 8.2 6.45 - 1.62 40.0 0 50.0
4.7Post 37.3 ± 8.3 5.99 - 1.46 39.0 0 50.0
22
ADCPre 682 ± 347 2.66 0.198 664 0 1999
1
4.8Post 684 ± 402 2.46 0.379 634 0 2030
FFPre 23.0 ± 14.1 1.97 - 0.242 25.0 0 50.0
4.9Post 27.1 ± 14.2 2.22 - 0.584 30.0 0 50.0
increase of the kurtosis value with the increase of the outliers’ number. On the other hand, the signals
collected from patient 22 display a distribution with shorter and thinners tails when compared with a
normal distribution.
The skewness is always greater than 1 in absolute value in patients 2 and 6, which reflects a distri-
bution considered highly skewed. In patient 2 the skewness is positive in both ADC and FF maps, while
for patient 6 the skewness concerning the ADC map is positive and concerning the FF map is negative.
A positive or negative skewness reflects the concentration of the signal intensity values to the left or to
the right, respectively. The skewness from the data collected from patient 22 is always inferior to 1 and
in the majority of the cases under 0.5, which indicate a moderate to non-significant asymmetry in the
distribution of the signal intensity data collected.
The range of the signal intensities values from the ADC of FF maps is considerably different. The
FF maps have intensities ranging from 0 to 50 in all cases, while the values in the ADC maps are in
between 0 and, approximately, 1000 for patients 2 and 6 and in between 0 and, approximately, 2000 for
patient 22. Considering only the FF maps, patients 2 and 22 present a similar signal intensity evolution
in absolute value from the pre-treatment to the post-1stcycle moment; both mean and median suffer a
variation in between 4 and 5 units (8% to 10% alteration in signal intensity), that may be considered
relevant given the range of intensities in these maps (0 to 50). The alterations regarding patient 6 in
terms of the mean and median from the pre-treatment to the post-1stcycle moments are less evident
(in between 0.7 and 1 units, that corresponds to a variation inferior to 1%). Considering only the ADC
49
maps, patients 2 and 6 have a similar evolution from the pre-treatment to the post-1stcycle moment, in
absolute value, both the mean and median show a variation in between 35 and 60 units (3.5% to 6.0%
alteration in signal intensity), that might be considered relevant given the range of intensities (0 to 1000).
The alterations regarding patient 22 in terms of the mean and median from the pre-treatment to the
post-1stcycle moment are much less evident (in between 2 and 30 units, that corresponds to a variation
inferior to 1.5%) given the signal intensities range (0 to 2000).
Finally, from the observation of the graphics alone, it appears that patients 2 and 22, classified as
having progressive disease and stringent complete response, respectively, have more distinct graphics
when comparing the pre-treatment against the post-1stcycle signal intensities collected.
50
Figure 4.4: Density plot relative to the signal inten-sities removed from the ADC maps from patient 2,whom presents a response classification of 6 tothe induction therapy. Legend: blue - pre-treatmentand orange - post-1stcycle.
Figure 4.5: Density plot relative to the signal inten-sities removed from the FF maps from patient 2,whom presents a response classification of 6 tothe induction therapy. Legend: blue - pre-treatmentand orange - post-1stcycle.
Figure 4.6: Density plot relative to the signal inten-sities removed from the ADC maps from patient 6,whom presents a response classification of 4 tothe induction therapy. Legend: blue - pre-treatmentand orange - post-1stcycle.
Figure 4.7: Density plot relative to the signal inten-sities removed from the FF maps from patient 6,whom presents a response classification of 4 tothe induction therapy. Legend: blue - pre-treatmentand orange - post-1stcycle.
Figure 4.8: Density plot relative to the signal intensi-ties removed from the ADC maps from patient 22,whom presents a response classification of 1 tothe induction therapy. Legend: blue - pre-treatmentand orange - post-1stcycle.
Figure 4.9: Density plot relative to the signal inten-sities removed from the FF maps from patient 22,whom presents a response classification of 1 tothe induction therapy. Legend: blue - pre-treatmentand orange - post-1stcycle.
51
52
5 Discussion
MRI is being increasingly used for the initial evaluation of patients with MM, as well as for monitoring the
disease’s progression. This non-invasive method may be used either qualitatively or quantitatively with
promising results. [13] [14]
In this study, the primary goal was to investigate the possibility of using first-order statistics features
extracted from ADC and FF maps as imaging biomarkers in the early prediction of treatment response
in multiple myeloma.
There are several reports that attest the reproducibility of the measurement of signal FF [14] [63]
and the repeatability of ADC measurements [12]. The differentiation between responders and non-
responders in MM treatment is a common practice. [12] [13] [14] [63]
As one may observe from this study, from the 35 (out of 216) imaging features found statistically
significant after the Mann-Whitney U-Test, only 10 survive the Benjamini-Hochberg procedure and no
attribute survived the more conservative approaches. When evaluating the clinical data set, 8 (out of
the 69) clinical variables were found statistically significant after the Mann-Whitney U-Test and only
1 attribute survived the multiple comparison corrections. The application of the multiple comparison
corrections has the underlying objective of avoiding overoptimistic results. Due to the large number
of variables evaluated, the possible discovery of attributes found statistically significant only by chance
needs to be taken into consideration.
Interestingly, all the 11 attributes that survived the BH correction belong to scenario 1. This is the
only scenario were patients with partial response are classified as responders. Both scenarios 2 and 3
consider the patients with partial response to treatment as belonging to the non-responders groups, al-
though scenario 3 excludes from the analysis patients who are classified as having stable or progressive
disease.
In particular, all the imaging features that remain statistically significant after the BH correction at a
significance level of 0.05 belong to the pre-treatment data set. Considering this particular data set, one
may say that scenario 1 displays best overall results when compared with scenario 2. In this comparison,
the definition of best results is the existence of a higher number of attributes found statistically significant
and with an AUC value superior to 0.70. This observation could be an indication that patients with partial
response show a signal intensity distribution in the parametric map before the induction treatment closer
to the one displayed by patients with very good partial or complete response. After all, when the patients
with partial response are transferred to the non-responders group the overall attributes’ discriminatory
power decreases. Just still considering the pre-treatment data set, when comparing scenarios 2 and
3, in other words, when the patients with stable or progressive disease are removed from the analysis
and the patients with very good partial or complete response are compared only with the patients with
partial response, the results improve, particularly the results associated with the attributes extracted
from the FF map. This improvement might be an indication that the patients with partial response can
53
be distinguished from the patients with a complete or very good partial response using data acquire
previously to the treatment’s beginning, but that this distinction is more plausible in the absence of
patients that do not respond to the induction therapy. Any possible distinction that could be done a priori
regarding the expected patient’s final response would be a valuable asset in terms of patient care, since
it could potentially avoid the patient’s submission to unnecessary debilitating procedures.
In addition, scenarios 2 and 3 show similar and better results than scenario 1 when considering the
delta data set. This situation may indicate that the variation in signal intensity from the patients with
complete or very good partial response to treatment is distinct from the variation observed in the re-
maining patients. This observation may support the hypothesis that patients with partial response could
be distinguished from patients with complete or very good partial response to treatment, in this particu-
lar case after a single round of induction chemotherapy. This kind of differentiation among responders
might allow the performance of treatment adjustments that would, ideally, allow the optimization of the
treatment route for each patient. Thus, also leading to patient care improvement.
A forth scenario that could be interesting to explore in a bigger data set would be the division of the
patients among 3 classes: patients with complete or very good partial response (treatment’s response
classification: 1, 2 and 3), patients with partial response (treatment’s response classification: 4) and
patients with no positive response (treatment’s response classification: 5 and 6).
There are six attributes from the MR images which are considered to best predict the final treatment
response considering the p-value evaluation and the ROC curve analysis. All these key attributes survive
the Benjamini-Hochberg correction at a significance level of 0.05 and are associated with an AUC value
superior to 0.850. These attributes are: kurtosis from the pre-treatment ADC map in scenario 1 [P =
0.042, AUC = 0.855 (0.679 − 1.00)] and 90 percentile [P = 0.042, AUC = 0.879 (0.747 − 1)], median
[P = 0.042, AUC = 0.856 (0.698 − 1)], root mean squared [P = 0.042, AUC = 0.856 (0.740 − 1)],
skewness [P = 0.042, AUC = 0.856 (0.702 − 1)] and total energy [P = 0.042, AUC = 0.864 (0.703 − 1)]
from the pre-treatment FF map in scenario 1.
From the more detailed analysis with the resource to the box and whisker plots, one may say that
the key attributes from the pre-treatment FF map appear to display some of the best separations be-
tween classes when compared with the remaining key attribute. This result underlines the possibility of
using imaging features extracted before the beginning of treatment as imaging biomarkers, with special
focus on the FF maps, with sensitivity superior to 66% and specificity superior to 81%. If any of these
biomarkers would be validated as reliable in a larger cohort of patients, this result is very optimistic in
patient care. As mentioned above, all the key attributes belong to scenario 1, as a consequence, these
key imaging features could lead to the discrimination between patients who achieved a minimum of par-
tial response and patients non-responsive to treatment before the treatment has started. Thus, these
biomarkers could prevent the patients from undergoing in unnecessary debilitating procedures.
It was mentioned previously that the data sets displayed unbalanced class proportion and, as a
consequence, the performance metric accuracy could indicate misleading results. Considering this
information, other performance metrics were estimated and, as it can be seen in Table 4.10, most of
the attributes in scenario 1 display a higher accuracy than precision or F1-measure, with a difference
54
ranging between 19.5 and 26.5 percentage points. This difference may reflect an optimistic estimation
of the accuracy due to the considerable lower amount of positively labeled data (approximately 20%
positive observations).
In Appendix D.1 are displayed the attributes that were considered good based on the AUC analy-
sis alone, but that did not survive the multiple comparison corrections, which lead to their exclusion as
possible key attributes. These attributes are found across the three scenarios evaluated. These sce-
narios have differently balanced data set, being scenario 1 the most unbalanced and scenario 3 the
most balanced, with almost a 50/50 patient’s division among classes. One may observe that the perfor-
mance metrics accuracy, precision and F1-measure get successively closer in value. This result support
the initial permissive that the evaluation of an attribute performance should be done by more than one
performance metric.
Giles (2014) analyzes the evolution of the ADC signal intensity in patients with multiple myeloma.
They suggest that mean signal intensity from the ADC map increased in responders but not in non-
responders over the course of three cycles of chemotherapy. [12] They consider patients with partial
response as responders, which corresponds to scenario 1 in this work. The overall increase or stability
in the ADC signal intensity in both classes is also reported by Latifoltojar (2016 and 2017). [13] [14]
Latifoltojar (2017) also analyzes the FF signal intensity in patients with multiple myeloma, using
whole-body MRI, focused on focal lesions. The classes responders and non-responders were defined as
in scenario 1. In the class responders they found significant changes on the mean signal intensity, over
the course of two cycles of chemotherapy. While in the class non-responders, no significant alterations
were found. These authors acknowledge stronger alterations regarding the signal FF, that is considered
to be a best biomarker against ADC. [14] This conclusion is sustained by a previous study of the same
authors, where significant increases in signal FF after 8 weeks of treatment show the potential of early
signal FF changes in the prognostic of MM. [13] Both studies are focused on focal lesions as they are
recognized to be more relevant to disease pathogenesis and risk assignment, when compared to diffuse
marrow signal abnormalities. [64]
There are two main differences between this study and the ones cited above. While in this work only
some vertebrae from the spine were analyzed and the data regarding post-1stcycle was collected after
one round of chemotherapy, in the study conducted by Giles (2014) and Latifoltojar (2016 and 2017)
they did whole-body DWI and the period of treatment analyzed is superior.
The studies cited [12] [14] aim to study the overall signal intensity changes as potential biomarkers
for MM, focused on the mean. In this thesis several first order imaging features were evaluated individ-
ually through an univariate analysis. The comparison that can be made between studies is therefore
limited. In this work, the specific attribute mean was not found statistically significant after the multiple
comparison corrections in any of the analysis conducted considering the evolution of signal intensity
(delta). When considering the pre-treatment data set regarding the FF map in scenario 1, this first order
statistic displays statistical significance after the multiple comparison analysis at the significance level
of 0.05 and an AUC values of 0.841. Finally, there are several patterns associated with MM (such as
the existence of focal lesions or the predominance of diffuse infiltration), which may condition the water
55
dispersion and, consequently, the intensity of the signals acquired. [62]
In addition to the main goal of this study, it was conducted the evaluation of possible clinical biomark-
ers and the comparison between the data collected from the pre-treatment and post-1stcycle signal
intensities extracted from the ADC and FF maps from patients with different responses to treatment.
From the clinical variables studied, there is one clinical variable that revealed particular interest: the
serum lactate dehydrogenase (LDH) considering the very good AUC value [AUC = 0.906 (0.718 − 1)],
which reflects great performance metrics. This attribute is associated with a sensitivity of 83.3% and a
specificity of 100% at the threshold to 195.5. Nevertheless, this variable does not survive the multiple
comparison corrections, what may indicate that the LDH may be considered statistically significant after
the Mann-Whitney U-Test only by chance (P = 0.005). There are several references that support the
LDH as valuable attribute in the prediction of disease progression in untreated MM patients. [65] [66]
[67] High LDH levels are usually associated with disease proliferation and aggressiveness, therefore it is
correlated with a negative outcome regarding response to treatment. [68] This result indicates that the
combination between clinical variables and imaging features may be interesting to exploit in future work
with a bigger cohort of patients.
According with the density plots depicted for the analysis between the two moments analyzed in this
study (before and after the first round of chemotherapy), it appears that in most cases exists a visual
variation in the signal intensity registered in the ADC and FF maps, described also in the variation of the
statistical metrics associated.
The lesions that occur on multiple myeloma are associated with high cellularity and high water con-
tent. In the density plots regarding the FF signal on the different patients, there seems to be a good
relation between the variation in signal intensity and the evolution of the disease. The signal shifting
along the x-axis is coherent with the accumulation of water in the lesions sites on MM patients. The
increase in water molecules with the lesions’ evolution should lead to a decrease in the fat fraction on
these sites. Therefore, an increase is expected in the overall signal intensity in patients that respond to
treatment (case of patient 22) and a decrease of the overall the signal intensity with progressive disease
(case of the patient 2).
In addition, upon the visualization of the density plots, one may say that the variation on ADC signal
intensity for patient 2 (an example of a patient with progressive disease) and for patient 22 (an example of
a patient with complete stringent response) are coherent with the observations of Messiou (2012), where
it is stated that ADC values are higher in marrow with active myeloma than in marrow in remission. [69]
The fat fraction maps might be considered a more reliable source of information when considering
disease progression. The quantitative parameter FF is a result of the simple estimation of the fat fraction
within the bone marrow. On the other hand the temporal changes on the ADC are dependent on sev-
eral variables, such as perfusion, cellularity, fat fraction and water content. The balance among these
characteristics may lead to an unpredictable result when it comes to what one expects on a ADC map.
For instance, the increase of cellularity associated with multiple myeloma would lead to a decrease of
the ADC; on the other hand, the increase in perfusion, also associated with the progression of multi-
ple myeloma would lead to an increase of the ADC. These conflicting variations, may lead to a poor
56
description of patient’s disease progression, specially when considering patient intervariability. [14] [70]
There are some limitations to this study, one of the biggest being the small group of patients par-
ticipating. The reduced number of patients is a common problem in this type of studies due to the
deteriorated physical condition that they have to endure. The treatment they undergo is very aggressive
and it leaves them fragile and unwilling to participate in additional tests for research purposes. Another
limitation is the arbitrary choice of the moment to perform the second round of image acquisition. Al-
though there are reports that document change in signal intensity, specifically regarding ADC, as early
as one month after the beginning of the treatment [69], the choice for the second moment of image
acquisition may not be optimal to distinguish responders from non-responders. A third limitation found
in this study is the segmentation step, since it is a source of variability in the data obtained from the MR
images.
57
58
6 Conclusions
There are attributes found in the univariate analysis which present potential in the discrimination between
classes, responders and non-responders, with good sensitivity and specificity rates. Five attributes
display special interest since they may be extracted from the FF maps before the induction treatment:
90 percentile with a sensitivity of 66.7% and a specificity of 90.9% and median, root mean squared,
skewness and total energy with a sensitivity 83.3% and a specificity 81.8%. These results are particularly
interesting, since they indicate that the final response to induction therapy could be predicted before the
treatment starts, which would improve patient care by avoiding unnecessary debilitating procedures.
In addition, the comparison among the three scenarios approached in this study indicates that a
distinction between patients with complete or a very good partial response and patients with partial
response may be possible. This kind of differentiation among responders might allow the performance
of treatment adjustments that would, ideally, permit the optimization of the treatment route for each
patient. Thus, also leading to patient care improvement.
Ideally, there would be a substantial increase in the number of patients available, as this condition is
key to allow the development of a multivariate model.
The univariate analysis is used as a proof of concept, in the sense that it shows the individual
potential of each attribute to predict the final response to treatment of MM. Nevertheless, a more complex
and probably more accurate approach could be achieved with the multivariate analysis.
In the multivariate analysis, with a larger cohort of patients, there is the possibility of combining
different attributes, which is not possible in this study since it would lead to an overfitted classifier. A
first possible step could be combining the features found statistical significant in the univariate analysis,
which includes the combination of imaging features from both ADC and FF maps. In addition, one
may consider the association of clinical biomarkers, with particular interest in the ones that may be
determined resorting to non-invasive techniques.
The first part of the thesis consisted on the extensive creation of segmentation masks for all four
types of MRI images collected (T1-weighted, STIR, In and Opposed phase gradient echo and DWI)
comprehending all visible vertebrae. Besides the 30 patients that were eligible for this study, the MRI
images of other 97 patients were segmented. This work will be used as ground truth in a deep learning
algorithm for automatic segmentation that is being developed by the master student Jose Maria Moreira
at the Champalimaud Foundation.
59
60
Bibliography
[1] N. C. Institute. What is cancer?, February 2015 (accessed March 2019).
https://www.cancer.gov/about-cancer/understanding/what-is-cancer.
[2] M. Roser and H. Ritchie. Cancer. Our World in Data, 2019 (accessed March 2019).
https://ourworldindata.org/cancer.
[3] W. H. Organization. The top 10 causes of death, May 2018 (accessed March 2019).
https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death.
[4] T. A. C. S. medical and editorial content team. What is multiple myeloma? American Cancer So-
ciety, 2018 (accessed March 2019). https://www.cancer.org/cancer/multiple-myeloma/about/what-
is-multiple-myeloma.html.
[5] C. C. Society. The plasma cells, (Accessed March 2019). http://www.cancer.ca/en/cancer-
information/cancer-type/multiple-myeloma/multiple-myeloma/the-plasma-cells/?region=on.
[6] S. B. A. Baur-Melnyk, H. R. Durr, and M. Reiser. Role of MRI for the diagnosis and prognosis of
multiple myeloma. European journal of radiology, 2005. doi:10.1016/j.ejrad.2005.01.017.
[7] H.-O. voor Volwassenen Nederland (HOVON). M-protein in multiple myeloma, 2004 (accessed
March 2019). http://www.hovon.nl/upload/File/Studies AlgStudInfo HovonRichtlijnenDocs/M-
protein%20in%20MM v02jun04.pdf.
[8] J. C. Dutoit and K. L. Vesstraete. Mri in multiple myeloma: a pictorial review of diagnostic and
post-treatment findings. Insights Imaging, 2016. doi:10.1007/s13244-016-0492-7.
[9] Chemocare. Melphalan, 2002-2019 (accessed March 2019).
http://chemocare.com/chemotherapy/drug-info/Melphalan.aspx.
[10] G. N. Le, J. Bones, M. Coyne, D. Bazou, P. Dowling, P. O’Gorman, and A.-M. Larkin. Current and
future biomarkers for risk-stratification and treatment personalisation in multiple myeloma. Royal
Society of Chemistry, 2018. doi:10.1039/c8mo00193f.
[11] P. Hugo J. W. L. Aerts. The potential of radiomic-based phenotyping in precisionmedicine. a review.
JAMA Oncology, 2016. doi:10.1001/jamaoncol.2016.263.
[12] S. L. Giles, C. Messiou, D. J. Colins, V. A. Morgan, C. J. Simpkin, and et al. Whole-body diffusion-
weighted MR imaging for assessment of treatment response in myeloma. Radiology, 2014.
[13] A. Latifoltojar, M. Hall-Craggs, N. Rabin, R. Popat, A. Bainbridge, and et al. Whole body mag-
netic resonance imaging in newly diagnosed multiple myeloma: early changes in lesional signal fat
fraction predict disease response. British Journal of Haematology, 2016. doi:10.1111/bjh.14401.
61
[14] A. Latifoltojar, M. Hall-Craggs, A. Bainbridge, N. Rabin, R. Popat, and et al. Whole-body MRI
quantitative biomarkers are associated significantly with treatment response in patients with newly
diagnosed symptomatic multiple myeloma following bortezomib induction. Eur. Radiol., 2017.
doi:10.1007/s00330-017-4907-8.
[15] D. Shah, K. Seiter, F. Talavera, and E. C. Besa. Multiple myeloma guidelines. Medscape, 2018
(accessed April 2019). https://emedicine.medscape.com/article/204369-guidelinesg1%20.
[16] P. Moreau, J. S. Miguel, P. Sonneveld, M. V. Mateos, E. Zamagni, and et al. Multiple myeloma
guidelines. ESMO, 2017. doi:10.1093/annonc/mdx096.
[17] R. J. Gillies, P. E. Kinahan, and H. Hricack. Radiomics: Images are more than pictures, they are
data. Radiology, 2016.
[18] H. Collins, S. Calvo, K. Greenberg, L. F. Neall, and S. Morrison. Information needs in the precision
medicine era: How genetics home reference can help. Interactive journal of medical research,
2016. doi:10.2196/ijmr.5199.
[19] V. Kumar, Y. Gu, S. Basu, A. Berglund, S. A. Eschrich, and et al. Radiomics: the process and the
challenges. Magnetic Resonance Imaging, 2012.
[20] L. E. Court, X. Fave, D. Mackin, J. Lee, J. Yang, and L. Zhang. Computational resources for
radiomics. Translational Cancer Research, 2016. 10.21037/tcr.2016.06.17.
[21] D. C. Preston. Magnetic resonance imaging (mri) of the brain and spine: Basics, 2006 (accessed
April 2019). http://casemed.case.edu/clerkships/neurology/web%20neurorad/mri%20basics.htm.
[22] D. J. Bell, J. Jones, and et al. Larmor frequency. Radiopedia, 2005-2019 (accessed April 2019).
https://radiopaedia.org/articles/larmor-frequency.
[23] E. K. Outwater, R. Blasbaig, E. S. Siegelman, and M. Vala. Detection of lipid in abdominal tis-
sues with opposed-phase gradient-echo images at 1.5 t: Techniques and diagnostic importance.
RadioGraphics., 1998. 18:1465-1480.
[24] M. A. Berstein, K. F. King, and X. J. Zhou. Hanbook of MRI Pulse Sequences. Elsevier Academic
Press, 2004. ISBN:978-0120-92861-3.
[25] D. J. Bell, J. Jones, and et al. T1 weighted images. Radiopedia, 2005-2019 (accessed April 2019).
https://radiopaedia.org/articles/t1-weighted-image.
[26] A. Murphy, J. Jones, and et al. T2 weighted images. Radiopedia, 2005-2019 (accessed April 2019).
https://radiopaedia.org/articles/t2-weighted-image.
[27] R. Sharma, Mohammad, and et al. Short tau inversion recovery. Radiopedia, 2005-2019 (accessed
April 2019). https://radiopaedia.org/articles/short-tau-inversion-recovery.
[28] E. Placidi. Magnetic resonance imaging of colonic function. PhD thesis, University of Nottingham,
2011.
62
[29] A. S. Shetty, A. L. Sipe, M. Zulfiqar, R. Tsai, D. A. Raptis, and et al. In-phase and opposed-phase
imaging: Applications of chemical shift and magnetic susceptibility in the chest and abdomen.
RadioGraphics, 2018 (accessed April 2019). https://doi.org/10.1148/rg.2019180043.
[30] H. J. Shin, H. G. Kim, M.-J. Kim, H. Koh, H. Y. Kim, and et al. Normal range of hepatic fat fraction
on dual- and triple-echo fat quantification MR in children. PLoS ONE, 2015. 10(2):e0117480.
[31] E. O. Stejskal and J. E. Tanner. Spin diffusion measurements: Spin echoes in the presence of a
time-dependent field gradient. The Journal of Chemical Physics, 1965. 42(1):288-292.
[32] J. H. Burdette, D. D. Durden, A. D. Elster, and Y. F. Yen. High b-value diffusion-weighted MRI of
normal brain. J Comput Assist Tomogr, 2001. 25:515-519.
[33] P. B. Kingsley and W. G. Monahan. Selection of the optimum b factor for diffusion-weighted mag-
netic resonance imaging assessment of ischemic stroke. Mag Reson Med, 2004. 51:996-1001.
[34] R. Channel. Difusion weighted imaging - radiology video tutorial (mri), 2015 (accessed April 2019).
https://www.youtube.com/watch?v=YHxi-Juf G0.
[35] J. J. M. van Griethuysen, A. Fedorov, C. Parmar, A. Hosny, N. Aucoin, and et al. Computa-
tional radiomics system to decode the radiographic phenotype, 2017. Cancer Research, 77(21),
e104–e107. https://doi.org/10.1158/0008-5472.CAN-17-0339.
[36] M. G. Bulmer. Principles of Statistics. Dover Books of Mathematics, 1979.
[37] P. H. Westfall. Kurtosis as peakedness, 1905 – 2014. r.i.p. The American statistician, 2014.
68(3):191-195.
[38] M. M. Oken, R. H. Creech, D. C. Tourney, J. Horton, T. E. Davis, and et al. Toxicity and response
criteria of the eastern cooperative oncology group. American Journal of Clinical Oncology, 1982.
5(6), 649-656.
[39] WebMD. Websters’s New WorldTM Medical Dictionary. Wiley Publishing, Inc., Hoboken, New
Jersey, third edition, 2008. ISBN: 978-0-470-18928-3.
[40] Creatinine and creatinine clearance blood tests. WebMD, 2005-2019 (accessed October 2019).
https://www.webmd.com/a-to-z-guides/creatinine-and-creatinine-clearance-blood-tests1.
[41] A. Dasgupta and A. Wahed. Clinical chemistry, immunology and laboratory quality control. Science
Direct, 2014. https://doi.org/10.1016/B978-0-12-407821-5.00013-9.
[42] J. E. Masterson and S. D. Schwartz. The enzymatic reaction catalyzed by lactate dehydrogenase
exhibits one dominant reaction path. Chemical physics, 2014. 442(17):132-136.
[43] M. Fraser and C. Haldeman-Englert. Health Encyclopedia, Latic Acid Dehydrogenase (Blood).
University of Rochester Medical Center, Rochester, New York, 2019 (accessed October 2019).
https://www.urmc.rochester.edu/encyclopedia/content.aspx?contenttypeid=167&contentid=lactic
acid dehydrogenase blood.
63
[44] M. L. Vekaria, B. Rao, and P. Kuriakose. Significance of bone marrow plasma cell percentage
in patients with monoclonal gammopathy of unknown significance developing multiple myeloma.
Blood, 2014. 124(21):5688.
[45] R. K. Loh, S. Vale, and A. McLean-Tooke. Quantitative serum immunoglobulin tests. Australian
family physician, 2013. 42(4):195-8.
[46] J. A. Katzmann, R. J. Clark, R. S. Abranhem, S. Bryant, J. F. Lymp, and et al. Serum reference
intervals and diagnostic ranges for free kappa and free lambda immunoglobulin light chains: relative
sensitivity for detection of monoclonal light chains. Clinical Chemistry., 2002. 48(9):1437-44.
[47] I. M. Foundation. International staging system (iss) and revised iss (r-iss).
https://www.myeloma.org/multiple-myeloma/staging-risk-stratification/international-staging-
system-iss-reivised-iss-r-iss (accessed October 2019).
[48] P. R. Greipp, J. S. Miguel, B. G. Durie, J. J. Crowleu, B. Barlogie, and et al. International staging
system for multiple myeloma. Journal of Clinical Oncology., 2005. 23(15):3412-20.
[49] P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining, chapter 1. Pearson Education,
2006. ISBN 0-321-42052-7.
[50] T. Dejoie, J. Corre, H. Caillon, P. Moreau, M. Attal, and H. A. Loiseau. Response in multi-
ple myeloma should be assigned according to serum, not urine, free light chain measurements.
Leukemia, 2019. 33:313–318.
[51] B. G. Tabachnick and L. S. Fidell. Using Multivariate Statistics, chapter 1. Pearson, sixth edition,
2006. ISBN 978-0-205-89081-1.
[52] R. Ho. Handbook of Univariate and Multivariate Data Analysis and Interpretation with SPSS. Chap-
man Hall/CRC, 2006. ISBN 978-1-584-88602-0.
[53] M. C. Morais. Notas de apoio da disciplina de Probabilidade e Estatıstica. Instituto Superior
Tecnico, 2012.
[54] C. M. R. Kitchen. Nonparametric versus parametric tests of location in biomedical reserch. Ameri-
can journal of ophthalmology., 2009. 147(4):571-572.
[55] B. S. Everitt. The Cambridge Dictionary of Statistics. Cambridge University Press, 2002.
[56] D. J. Sheskin. Handbook of Parametric and Nonparametric Statistical Procedures. Chapman and
Hall/CRC, second edition, 2000.
[57] J. H. McDonald. Handbook Biological Statistics, pages 254–260. Sparky House Publishing, third
edition, 2014. http://www.biostathandbook.com/multiplecomparisons.html.
[58] S. Holm. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics,
Vol. 6, No. 2, pp. 65-70, 1979.
64
[59] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful ap-
proach to multiple testing. Scandinavian Journal of Statistics, Vol. 57, No. 1, pp. 289-300, 1995.
[60] N. Seliya, T. M. Khoshgoftaar, and J. V. H. z. A study on the relationships of classifier performance
metrics. IEEE Computer Society, 2009. doi:10.1109/ICTAI.2009.25.
[61] T. Fawcett. An introduction to roc analysis. Elsevier, 2005. doi:10.1016/j.patrec.2005.10.010.
[62] V. Koutolidis, S. Fontara, E. Terpos, F. Zagouri, D. Matsaridis, and et al. Quantitative diffusion-
weighted imaging of the bone marrow: An adjunct tool for the diagnosis of a diffuse MR imaging
pattern in patients with multiple myeloma. Radiology, 2017. Volume 282: Number 2 - February
2017.
[63] M. Maas, E. M. Akkerman, H. W. Venema, J. Stoker, and G. J. D. Heeten. Dixon quantitative
chemical shift MRI for bone marrow evaluation in the lumbar spine: A reproducibility study in healthy
volunteers. Journal of Computer Assisted Tomography, 2001. 25(5):691–697.
[64] S. V. Rajkumar, M. A. Dimopoulos, A. Palumbo, J. Blade, G. Merlini, and et al. International
myeloma working group updated criteria for the diagnosis of multiple myeloma. The Lancet. Oncol-
ogy, 2014. doi:10.1016/S1470-2045(14)70442-5.
[65] H. Uskudar Teke, M. Basak, M. Kanbay, and et al. Serum level of lactate dehydrogenase is a useful
clinical marker to monitor progressive multiple myeloma diseases: A case report. Turkish Journal
of Hematology, 2014. 31(1):84-87.
[66] M. A. Dimopoulos, B. Barlogie, T. L. Smith, and R. Alexanian. High serum lactate dehydrogenase
level as a marker for drug resistance and short survival in multiple myeloma. Ann. Intern. Med.,
1991. 115(12):931-5.
[67] E. Terpos, E. Katodritou, M. Roussou, A. Pouli, E. Michalis, and et al. High serum lactate dehydro-
genase adds prognostic value to the international myeloma staging system even in the era of novel
agents. Eur. Journal Haematology, 2010. doi: 10.1111/j.1600-0609.2010.01466.x.
[68] N. Hatakeyama, M. Daibata, Y. Nemoto, Y. Ohtsuki, and H. Taguchi1. Lactate dehydrogenase
production and release in a newly established human myeloma cell line. American Journal of
Hematology, 2001. 66:267–273.
[69] C. Messiou, S. Giles, and et al. Assessing response of myeloma bone disease with diffusion-
weighted MRI. The British Journal of Radiology, 2012. 85(1020):e1198-e1203.
[70] J. Hillengass, T. Bauerle, R. Bartl, M. Andrulis, F. McClanahan, and et al. Diffusion-weighted imag-
ing for non-invasive and quantitative monitoring of bone marrow infiltration in patients with mono-
clonal plasma cell disease: a comparative study with histology. British Journal of Haematology,
2011. doi:10.1111/j.1365-2141.2011.08658.x.
65
[71] S. V. Rajkumar, J. L. Harousseau, B. Durie, K. C. Anderson, M. Dimopoulos, and et al. Consensus
recommendations for the uniform reporting of clinical trials: report of the international myeloma
workshop consensus panel 1. Blood., 2011. 117:4691-5.
[72] X. Robin, N. Turck, A. Hainard, N. Tiberti, J.-C. Sanchez, and et al. Display and analyze roc curves.
Technical report, 2019. Package ’pROC’, https://cran.r-project.org/web/packages/pROC/pROC.pdf.
66
A Additional Informations
A.1 IMWG Guidelines
The IMWG uniform response criteria was created from the European Group for Blood and Bone Marrow
Transplant, the International Bone Marrow Transplant Registry and the American Bone Marrow Trans-
plant Registry published criteria, commonly referred to as the Blade criteria or the European Group for
Blood and Bone Marrow Transplant criteria. [71]
Table A.1: International Myeloma Working Group uniform response criteria by response subcategory formultiple myeloma. [Part I] [50]
Response Criteria
Progressive Disease Requires any one or more of the following:
• Increase of 25 % from lowest response value in serum M-protein (ab-solute increase must be ≥ 0.5 g / dL), and/or urine M-protein (abso-lute increase must be ≥ 200 mg / 24 h).
• Only in patients without measurable serum and urine M-protein lev-els: the difference between involved and uninvolved free light chain(FLC) levels (absolute increase must be > 10 mg / dL).
• Only in patients without measurable serum and urine M protein levelsand without measurable disease by FLC levels, bone marrow plasmacells (PCs) percentage (absolute percentage must be ≥ 10 %).
• Definite development of new bone lesions or soft tissue plasmacy-tomas or definite increase in the size of existing bone lesions or softtissue plasmacytomas.
• Development of hypercalcemia (corrected serum calcium > 11.5 mg/ dL) that can be attributed solely to the PC proliferative disorder.
ImmunophenotypicCR
Stringent CR plus absence of phenotypically aberrant PCs (clonal) in BMwith a minimum of 1 million total BM cells analyzed by multiparametric flowcytometry (with > 4 colors).
Molecular CR CR plus negative ASO-PCR, sensitivity 10−5
67
Table A.1: International Myeloma Working Group uniform response criteria by response subcategory formultiple myeloma. [Part II] [50]
Response Criteria
Stringent completeresponse
CR as defined, plus
• Normal FLC ratio, and
• Absence of clonal PCs by immunohistochemistry or 2- to 4-color flowcytometry.
Complete response
• Negative immunofixation of serum and urine, disappearance of anysoft tissue plasmacytomas, and < 5 % PCs in bone marrow.
Very good partialresponse • Serum and urine M-protein detectable by immunofixation but not on
electrophoresis, or
• ≥ 90 % reduction in serum M-protein plus urine M-protein < 100 mg /24 h.
Partial response
• ≥ 50 % reduction of serum M-protein and reduction in 24-hour urinaryM-protein by ≥ 90 % or to < 200 mg / 24 hours.
• If the serum and urine M-protein are not measurable, a decrease ≥50 % in the difference between involved and uninvolved FLC levels isrequired in place of the M-protein criteria.
• If serum and urine M-protein are not measurable, and serum free lightassay is also not measurable, ≥ 50 % reduction in bone marrow PCsis required in place of M-protein, provided baseline percentage was ≥30 %.
• In addition to the above criteria, if present at baseline, ≥ 50 % reduc-tion in the size of soft tissue plasmacytomas is also required.
Stable disease Not meeting criteria for CR, VGPR, PR, or PD.
68
B Data Sets
Table B.1: Response to the induction therapy in a numeric scale: 1 stringent complete response. 2complete response. 3 very good partial response. 4 partial response. 5 stable disease and 6 progressivedisease. The columns named scenario 1, scenario 2 and scenario 3 explicitly portray the class thateach patient is placed in, responder or non-responder. In scenario 3, the patients excluded from thestudy receive the designation excluded.
ID Response to Induction Therapy Scenario 1 Scenario 2 Scenario 3
1 3 responder responder responder
2 6 non-responder non-responder excluded
3 5 non-responder non-responder excluded
4 4 responder non-responder non-responder
5 3 responder responder responder
6 4 responder non-responder non-responder
7 4 responder non-responder non-responder
8 4 responder non-responder non-responder
9 3 responder responder responder
10 4 responder non-responder non-responder
11 4 responder non-responder non-responder
12 3 responder responder responder
13 4 responder non-responder non-responder
14 5 non-responder non-responder excluded
15 5 non-responder non-responder excluded
16 3 responder responder responder
17 3 responder responder responder
18 5 non-responder non-responder excluded
19 4 responder non-responder non-responder
20 3 responder responder responder
21 4 responder non-responder non-responder
22 1 responder responder responder
23 3 responder responder responder
24 3 responder responder responder
25 4 responder non-responder non-responder
26 4 responder non-responder non-responder
27 3 responder responder responder
28 3 responder responder responder
29 6 non-responder non-responder excluded
30 4 responder non-responder non-responder
69
70
C Materials and Methods
C.1 R code
C.1.1 Adjusted p-values
The function that returns the adjusted p-values resorting to the function p.adjust was written in R. Dif-
ferent multiple comparison corrections can be specified. As one can see, the Bonferroni correction
(”bonferroni”), the Holm method (”holm”) and the BH procedure (”BH”) are the corrections considered.
By the order presented, the corrections considered are sequentially less conservative. The p-values
obtained are being returned with 6 decimal places.
rm main = function ( data ) {
data$pvalue bon fe r ron i <− round ( p . ad jus t ( data$p value , ” bon fe r ron i ” ) , 6)
# Ca l cu l a t i on o f the adjusted p−value using the Bonfe r ron i c o r r e c t i o n
data$pvalue holm <− round ( p . ad jus t ( data$p value , ” holm ” ) , 6)
# Ca l cu l a t i on o f the adjusted p−value using the Holm method
data$pvalue BH <− round ( p . ad jus t ( data$p value , ”BH” ) , 6)
# Ca l cu l a t i on o f the adjusted p−value using the BH procedure
return ( data )
}
C.1.2 Generation of the ROC curves
All the information regarding the package pROC is available in the CRAN repository. [72] The code is
commented for an easier understanding of the functions, cycles and conditions written.
ROC = function ( data ){
l i b r a r y (pROC) #package wi th the f u n c t i o n ” roc ”
par ( p ty = ” s ” ) #squares the graph ic generated
i <− 2 #column number , sk ip ID
z <− colnames ( data [ , 2 : ( ncol ( data ) −2) ] ) # vec to r w i th the a t t r i b u t e s names
auc <− vector ( )
71
roc . df <− c ( 1 : 3 1 )
while ( i <=(ncol ( data )−2)) { #2−19 a t t r i b u t e s
#choose p o s i t i v e c lass based on c lass p ropo r t i on
# l e v e l s = c ( ” negat ive c lass ” , ” p o s i t i v e c lass ” )
#response i s 1 and non−response i s 0
# c a l c u l a t i o n o f the ROC curve parameters f o r a t t r i b u t e i n the p o s i t i o n ” i ”
#from the ” data ”
i f (mean( as . vector ( un l is t ( data [ , ncol ( data ) ] ) ) ) > 0 .5 ) { #scenar io 1
roc . i n f o <− roc ( data [ , ncol ( data ) ] , data [ , i ] , legacy . axes=TRUE,
levels=c ( 1 , 0 ) , plot=TRUE, x lab= ”FPR” , y lab= ”TPR” , col= ” #377eb8 ” ,
lwd =2 , pr in t . auc = TRUE, pr in t . auc . x = 0.45 , pr in t . auc . y = 0.12)
} else i f (mean( as . vector ( un l is t ( data [ , ncol ( data ) ] ) ) ) <=0.5) { #scenar io 2
roc . i n f o <− roc ( data [ , ncol ( data ) ] , data [ , i ] , legacy . axes=TRUE,
levels=c ( 0 , 1 ) , plot=TRUE, x lab= ”FPR” , y lab= ”TPR” , col= ” #377eb8 ” ,
lwd =2 , pr in t . auc = TRUE, pr in t . auc . x = 0.45 , pr in t . auc . y = 0.12)
}
j <− i−1
auc [ j ] <− auc ( roc . i n f o ) # r e t r i e v e s AUC f o r the a t t r i b u t e i n the p o s i t i o n ” i ”
#from the ” data ”
pr in t ( c i . auc ( roc . i n f o ) ) # p r i n t s DeLong conf idence i n t e r v a l s r e f e r e n t to the
#AUC value
# c rea t i on o f a data frame wi th the TPR, FPR and th resho ld f o r the a t t r i b u t e i n
# the p o s i t i o n ” i ” from the ” data ”
i . df <− data . frame ( var iable = rep ( colnames ( data [ i ] ) ) ,
TPR=roc . i n f o $ s e n s i t i v i t i e s , FPR=(1− roc . i n f o $ s p e c i f i c i t i e s )
th resho lds= roc . i n f o $ th resho lds , d i r e c t i o n = roc . i n f o $ d i r e c t i o n ,
s t r i ngsAsFac to rs=FALSE)
# d i f f e r e n t a t t r i b u t e s have d i f f e r e n t thresho lds , i n order to create the data
#frame wi th a l l the a t t r i b u t e s , a l l the i n d i v i d u a l data frames need to have the
#same leng th
while (nrow ( i . df )<31) { # f i l l i n g the remaining rows
r <− rep ( ”NA” )
i . df <− rbind ( i . df , r )
}
roc . df <− cbind ( roc . df , rep ( ” x ” ) , i . df ) # j u n c t i o n o f the data frame f o r the
72
# a t t r i b u t e i n the p o s i t i o n ” i ” from the ” data ” to the major data frame
i <− i +1
}
zauc <− cbind ( z , auc ) # mat r i x w i th the a t t r i b u t e s names and respec t i ve AUC values
}
73
74
D Results
D.1 ROC Curves Analysis
To all the attributes with an AUC over 0.70 was conducted a posterior evaluation of their ROC curve.
Some attributes only showed statistical significance after the Mann-Whitney U-Test but did not survive
the multiple comparison corrections, indicating that the first result could have been obtained only by
chance. Therefore, these attributes were excluded from a posterior analysis as key attributes.
In the Table D.1 are depicted the attributes found with the considered best pairs of TPR and FPR for
the different scenarios.
Since each scenario has a different class proportion, the comparison of the performance metrics
between them may not be straight forward. Also, as the positive class changes accordingly with the
scenario considered, thus the initial purpose of maximizing the TPR or minimizing the FPR is also
dependent on the scenario considered, as explained previously in the Subsection 4.1.1. When choosing
an attribute to pursuit in a more extensive study, these issues should be taken into consideration.
Table D.1: Summary of the attributes found statistically interesting when considering the AUC analysiswith their respective true positive rate (TPR), false positive rate (FPR), threshold at which this rates areverified, accuracy (acc), precision (pre), F1-measure (F1), AUC (with the respective 95% confidenceinterval) and p-values before and after the multiple comparison tests.
Scenario Data set Attribute AUC TPR FPR Threshold Acc Pre F1
1
ADC PreTreat Kurtosis 0.855 0.667 0.087 3.30 0.862 0.667 0.667
ADC PostTreat rMAD 0.810 0.800 0.143 111 0.846 0.571 0.667
FF PreTreat
90 Percentile 0.879 0.667 0.091 19.0 0.857 0.667 0.667
Median 0.856 0.833 0.182 11.0 0.821 0.556 0.667
RMS 0.856 0.833 0.182 17.4 0.821 0.556 0.667
Skewness 0.856 0.883 0.182 0.624 0.821 0.556 0.667
Total Energy 0.864 0.883 0.182 2.72×107 0.821 0.556 0.667
FF PostTreat Skewness 0.853 0.600 0 0.70 0.917 1.000 0.750
2
ADC Delta MAD 0.800 0.800 0.125 25.2 0.846 0.800 0.800
FF Delta10 Percentile 0.804 0.900 0.385 - 0.50 0.739 0.643 0.750
Mean 0.785 0.900 0.385 - 0.60 0.739 0.643 0.750
3
ADC Delta MAD 0.782 0.800 0.091 25.2 0.857 0.889 0.842
FF Delta
10 Percentile 0.919 0.875 0.100 -0.5 0.889 0.875 0.875
90 Percentile 0.906 1.000 0.200 0.5 0.889 0.800 0.889
Mean 0.900 1.000 0.200 - 0.122 0.889 0.800 0.889
RMS 0.888 1.000 0.200 - 0.100 0.889 0.800 0.889
75
D.2 ROC Curves
Figure D.1: ROC curve for the attribute Kurtosis inthe ADC pre-treatment data set in the scenario 1,with a correspondent AUC value of 0.855 (0.679-1.000).
Figure D.2: ROC curve for the attribute 90 Percentilein the FF pre-treatment in the scenario 1, with a cor-respondent AUC value of 0.879 (0.747-1.000).
Figure D.3: ROC curve for the attribute Median inthe FF pre-treatment in the scenario 1, with a corre-spondent AUC value of 0.856 (0.698-1.000).
Figure D.4: ROC curve for the attribute Root MeanSquares in the FF pre-treatment data set in the sce-nario 1, with a correspondent AUC value of 0.856(0.704-1.000).
76
Figure D.5: ROC curve for the attribute Skewness inthe FF pre-treatment data set in the scenario 1, witha correspondent AUC value of 0.856 (0.702-1.000).
Figure D.6: ROC curve for the attribute Total Energyin the FF post-treatment data set in the scenario 1,with a correspondent AUC value of 0.864 (0.703-1.000).
77
D.3 Box and Whiskers Plots
Table D.2: Statistical parameters concerning the design of the Box and Whiskers plots for the key at-tributes and clinical variables. The classes are identified as responders and non-responders correspond-ing to the class of patients that are considered to respond and not respond to treatment, respectively.
Statistical ParametersKurtosis (ADC PreTreat 1) 90 Percentile (FF PreTreat 1)
Responders Non-Responders Responders Non-Responderslower extreme 2,12 2,82 21 5Q1 2,57 2,96 36 10median 2,76 3,48 45 16Q3 3,06 3,68 46 32upper extreme 3,56 4,56 47 40number of observations 22 6 22 6outliers 10, 13, 20
Statistical ParametersMedian (FF PreTreat 1) RMS (FF PreTreat 1)
Responders Non-Responders Responders Non-Responderslower extreme 0 0 6 4.1Q1 18 0 22 6.9median 34 2.50 34 9.5Q3 38 5.00 36 15.0upper extreme 41 5.00 40 15.0number of observations 22 6 22 6outliers 31 31
Statistical ParametersSkewness (FF PreTreat 1) Total Energy (FF PreTreat 1)
Responders Non-Responders Responders Non-Responderslower extreme -1.22 -0.35 5330628 1488935Q1 -1.02 0.85 53686528 5280098median -0.58 1.55 166590529 10681660Q3 0.09 2.51 197928989 18283058upper extreme 1.69 3.47 249195096 18283058number of observations 22 6 22 6outliers 2.20, 2.24 151806015
Statistical ParametersLHD (Clinical 1) Serum IgA (Clinical 1)
Responders Non-Responders Responders Non-Responderslower extreme 100 144 23 25Q1 137 197 26 26.5median 148 227 29 32.5Q3 167.5 272 86 41upper extreme 194 277 171 45number of observations 23 6 21 4outliers 178, 541, 3180
78