Kernel-based image analysis towards MRI segmentation and ... · Kernel-based image analysis towards...

Kernel-based image analysis towardsMRI segmentation and classification

David Augusto Cardenas Pena

Universidad Nacional de Colombia

Faculty of Engineering and Architecture

Departament of Electrics, Electronics, and Computing Engineering

Manizales, Colombia

2016

Kernel-based image analysis towardsMRI segmentation and classification


A dissertation submitted in part fulfillment of the requirements for the degree of:

Doctor in Engineering - Automatics

Advisor:

Cesar German Castellanos Domınguez, PhD

Examining Committee:Beatriz Marcotegui Iturmendi, PhD Juan Manuel Gorriz Saez, PhD

MINES ParisTech Universidad de Granada

Pablo Andres Arbelaez Escalante, PhD

Universidad de los Andes

Academic Research Group:

Signal Processing and Recognition Group


Faculty of Engineering and Architecture

Departament of Electrics, Electronics, and Computing Engineering

Manizales, Colombia

2016

Analisis de imagenes empleandokernels para la segmentacion y

clasificacion de RNM


Tesis o trabajo de grado presentada(o) como requisito parcial para optar al tıtulo de:

Doctor en Ingenierıa - Ingenierıa Automatica

Director:

Ph.D. Cesar German Castellanos Domınguez

Jurados:Beatriz Marcotegui Iturmendi, PhD Juan Manuel Gorriz Saez, PhD

MINES ParisTech Universidad de Granada

Pablo Andres Arbelaez Escalante, PhD

Universidad de los Andes

Grupo de Investigacion:

Grupo de Control y Procesamiento Digital de Senales


Facultad de Ingenierıa y Arquitectura

Departamento de Ingenierıa Electrica, Electronica y Computacion

Manizales, Colombia

2016

A mi hermanita, buen viaje.

A Martın, bienvenido.

Acknowledgment

To my family, thanks for always been there, for all of your love, and for encouraging me.

I would like to express my gratitude to my advisor, Prof. Dr. German Castellanos-

Domınguez for his valuable orientation during this research and for teaching me to think

outside of the box. Thanks to the Signal Processing and Recognition Group (SPRG) at

the Universidad Nacional de Colombia (Manizales) for having me there along these years.

Thanks to my lab colleagues for discussing ideas and giving me new points of view. My

friends el Oso, el Paya, and el Mostro, always helped me on the good and hard times. Fi-

nally, thanks to Mauricio, my master student, for letting me advise him and working with

me.

Furthermore, thanks to all the members of the Fundacion Centro de Investigacion Enfer-

medades Neurologicas (Madrid, Spain), the Medical Imaging and Signal Processing (Ghent,

Belgium), and the Centre for Mathematical Morphology (Fontainebleau, France) for their

hospitality. Mainly, I like to thank professors Juan Antonio Hernandez Tamames, Stefaan

Vandenberghe, and Etienne Decenciere for the opportunity to visit their labs, as well as for

their helpful insights about open issues and research directions.

Finally, I recognize that this research had not been possible without the COLCIENCIAS

Ph.D. scholarship Programa Nacional de Formacion de Investigadores “Generacion del Bi-

centenario” 2011. In addition, some of the results of this work were partially supported by

the research projects 111045426008, 20101008258, 111056934461, also funded by COLCIEN-

CIAS.

xi

Abstract

Recently, medical image analysis has received significant interest due to its wide span of

applications including brain surgery, atlas building, and computer-aided diagnosis. In addi-

tion, kernel theory is one of the most considered machine learning methods in several tasks

due to its properties and multiple techniques. In this work, we combine both medical images,

specifically magnetic resonance images, and kernel theory for improving segmentation and

classification tasks.

The first contribution of the thesis is a novel tuning criterion for Gaussian kernel bandwidth

parameter, termed KEIPV. The approach maximizes the information potential variability in

the reproduced Hilbert space. Such criterion allows tuning all Gaussian kernels considered

in this work. Secondly, we propose new image representation that highlights inherent im-

age categories, particularly age and gender. Resulting embedded image similarity supports

atlas-based segmentation algorithms by selecting the most relevant templates so that com-

putational cost (induced by pairwise image registration) and segmentation performance are

improved. Then, we propose two template-based segmentation approaches: The first one

introduces an information cost function in the Bayesian image intensity modeling so bet-

ter fitting model parameters to the image properties. The second approach is patch-based,

where we introduce a voxel-wise feature extraction locally learned using supervised informa-

tion provided by neighboring label voxels. To this end, the maximization of the centered

kernel alignment (CKA) criterion enhances the tissue discrimination in the feature space.

Finally, a new training scheme for multi-layer perceptron (MLP) is described in the last

chapter with two contributions: A supervised pre-training MLP stage using CKA to learn

linear projecting matrices; and a matrix conditional entropy cost function for training MLP

parameters in a backpropagation updating scheme.

Keywords: Medical image analysis, Kernel theory, MRI clustering, Atlas-based seg-

mentation, Template selection, Bayesian segmentation, Patch-based segmentation,

Computer-aided diagnosis, Information-based cost function, Neural networks

xiii

Resumen

Recientemente, el analisis de imagenes medicas ha recibido gran interes debido a su am-

plia gama de aplicaciones incluyendo cirugıa cerebral, construccion de atlases, y diagnostico

asistido por computador. Por otro lado, la teorıa de Kernels es uno de los metodos de apren-

dizaje de maquina mas empleados en variadas tareas debido a sus propiedades y multiples

tecnicas. En este trabajo, se combinan las imagenes medicas, en particular las imagenes

de resonancia magnetica, y la teorıa de Kernels para mejorar las tareas de segmentacion y

clasificacion.

La primera contribucion de esta tesis es un nuevo criterio para la sintonizacion del ancho de

banda del kernel Gaussiano, como unico parametro libre, el cual es denominado KEIPV. El

algoritmo maximiza la variabilidad del potencial de informacion en el espacio reproducido

de Hilbert. Este criterio se emplea para la sintonizacion de todos los kernels Gaussianos

considerados en este trabajo. Luego, se propone una nueva representacion de imagenes 3D

que realza las categorıas inherentes en los sujetos, especıficamente edad y genero. La me-

dida embebida de similitud de imagenes soporta los algoritmos de segmentacion basados en

atlases al seleccionar las plantillas mas relevantes de tal forma que se reduce el costo com-

putacional (inducido por el registro deformable) y se mejora el desempeno de segmentacion.

Posteriormente, se proponen dos estrategias de segmentacion basadas en atlases: La primera

presenta una funcion de costo empleando medidas de informacion para el esquema de sege-

mentacion Bayesiana, tal que los parametros del modelo se ajustan mejor a las propiedades

de la imagen. La segunda es una estrategia empleando parches, para la que se propone

una extraccion de caractersticas voxel a voxel local que se entrena con informacion super-

visada proveniente de las etiquetas de voxeles vecinos. Con este objetivo, la maximizacion

del criterio del alineamiento de kernels centralizados (CKA) que mejora la discriminacion

de tejidos en el espacio de caracterısticas. Finalmente, un nuevo esquema de entrenamiento

para perceptrones multicapa (MLP) se describe en el ultimo capıtulo con dos contribuciones:

Una etapa de pre-entrenamiento supervisado usando CKA que estima matrices de proyeccion

lineal; y una funcion de costo empleando la entropıa condicional de matrices para el ajuste

fino de los parametros del MLP en un esquema de actualizacion por retropropagacion.

Palabras clave: Analis de imagenes medicas, Teorıa de kernels, Agrupamiento de

RNM, Segmentacion basada en atlases, Seleccion de plantillas, Segmentacion Bayesiana,

Segmentacion basada en parches, Diagnostico asistido por computador, Funcion de

costo empleando medidas de informacion, Redes Neuronales

Contents

Acknowledgement ix

Abstract xi

1 Preliminaries 2

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 State-of-the-art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3.1 Image Representation Enhancement . . . . . . . . . . . . . . . . . . . 8

1.3.2 Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4.1 General Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4.2 Specific objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Background 13

2.1 Reproducing kernel Hilbert spaces in machine learning . . . . . . . . . . . . 13

2.2 Gaussian kernel estimation from information potential variability . . . . . . 15

2.3 Template-based Image Segmentation . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Magnetic resonance image databases . . . . . . . . . . . . . . . . . . . . . . 20

3 Kernel-based Template Selection from the using Embedding Representations 26

3.1 Template selection for image segmentation . . . . . . . . . . . . . . . . . . . 26

3.1.1 Feature extraction based on inter-slice similarities . . . . . . . . . . . 27

3.2 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.1 MRI Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.2 ISK feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.3 Image similarity function from TKR . . . . . . . . . . . . . . . . . . 29

3.2.4 Tissue Labeling Performance . . . . . . . . . . . . . . . . . . . . . . . 32

3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

xvi Contents

4 Information-based cost function for Bayesian MRI segmentation 36

4.1 Bayesian Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1.1 Parameter optimization . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2.1 Evaluation of performed segmentation . . . . . . . . . . . . . . . . . 40

4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5 Multi-atlas label fusion using supervised local weighting 43

5.1 Feature-based label fusion within α-neighborhoods . . . . . . . . . . . . . . . 43

5.1.1 Supervised feature learning based on centered kernel alignment . . . . 44

5.1.2 CKA-LF optimization using gradient descent . . . . . . . . . . . . . . 46

5.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.2.1 Algorithm parameter setup . . . . . . . . . . . . . . . . . . . . . . . . 47

5.2.2 Patch-based segmentation performance . . . . . . . . . . . . . . . . . 49

5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6 Magnetic resonance image classification using kernel-enhanced neural networks 52

6.1 Multi-layer perceptron-based classifier using kernels . . . . . . . . . . . . . . 52

6.1.1 Matrix-based entropy as a cost function for MLP . . . . . . . . . . . 53

6.1.2 Network pre-training using centered kernel alignment . . . . . . . . . 54

6.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.2.1 Processing of MRI data . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.2.2 Tuning of ANN model parameter . . . . . . . . . . . . . . . . . . . . 58

6.2.3 Classifier performance of neurological classes . . . . . . . . . . . . . . 59

6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

7 Conclusions and Future Work 63

7.1 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

7.3 Academic discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Bibliography 69

Biographical sketch 81

List of Figures

1-1 4D neonatal brain atlas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1-2 Components of brain mapping. . . . . . . . . . . . . . . . . . . . . . . . . . 5

1-3 Magnetic resonance intensity histogram of brain structures. . . . . . . . . . . 6

2-1 Kernel-based mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2-2 KEIPV illustrative example. . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2-3 Multi-atlas-based segmentation scheme. . . . . . . . . . . . . . . . . . . . . . 19

2-4 Examples of MRI databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3-1 ISK representation for IXI database. . . . . . . . . . . . . . . . . . . . . . . 30

3-2 Similarity kernels for IXI database. . . . . . . . . . . . . . . . . . . . . . . . 31

3-3 KPCA-based projection of IXI database from SSD and TKR. . . . . . . . . 32

3-4 Template selection performance. . . . . . . . . . . . . . . . . . . . . . . . . . 33

4-1 α-order Renyi’s entropy versus the number of iteration for the optimization

procedure, for several α values and a given image in the dataset . . . . . . . 40

4-2 Average Dice similarity index versus the entropy order for available image

noise intensities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5-1 Proposed multi-atlas patch-based label fusion scheme. . . . . . . . . . . . . . 47

5-2 Parameter tuning for patch-based approaches by exhaustive search. . . . . . 48

5-3 Resulting kernel matrices for a random subset of voxels before and after learn-

ing the projection matrix Wr. Voxels are sorted by tissue type. . . . . . . . 48

5-4 β radius effect in a subject’s volume . . . . . . . . . . . . . . . . . . . . . . . 50

6-1 General MRI classification pipeline . . . . . . . . . . . . . . . . . . . . . . . 57

6-2 ANN performance versus the number of hidden nodes . . . . . . . . . . . . . 58

6-3 Relevance indexes grouped by feature type . . . . . . . . . . . . . . . . . . . 59

6-4 Receiver-operating-characteristic curve (top) and confusion matrix (bottom)

on the 30% test data for AEN (left), PCA (center), and CKA (right) initiza-

liation approaches at the best parameter set of the ANN classifier. . . . . . . 61

List of Tables

2-1 Demographic and clinical details of the selected ADNI cohort. . . . . . . . . 23

2-2 Summary of characteristics of the considered MRI databases. . . . . . . . . . 25

3-1 Template selection performance for each tissue using optimal number of atlases. 34

4-1 Dice index for each structure at optimal α = 0.5 . . . . . . . . . . . . . . . . 41

5-1 SATA segmentation performance . . . . . . . . . . . . . . . . . . . . . . . . 49

6-1 FreeSurfer extracted features. . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6-2 Best performing algorithms in the 2014 CADDementia challenge. . . . . . . 59

6-3 ADNI classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

1 Preliminaries

This chapter is intended to work as an introduction to the research problem and the upcoming

chapters. In this sense, a brief medical motivation about the research problem is firstly

introduced as well as the issues under consideration. A state-of-the-art related specifically to

magnetic resonance brain image analysis is then presented, and some of the main limitations

are highlighted. Finally, the proposed hypothesis and thesis objectives are listed.

1.1 Motivation

Medical images have evolved over the years providing essential information for clinical ap-

plications as diagnosis, therapy planning and execution, and disease and treatment mon-

itoring. The most known image modalities include magnetic resonance imaging (MRI),

ultrasound, X-ray computer tomography (CT), and positron emission tomography (PET).

Particularly, the MRI is a commonly used medical imaging technique as it is a non-invasive,

avoids ionizing radiation, and possesses good contrast resolution for different body tis-

sues [Wang and Wang, 2008]. MRI can reveal precise anatomy details, and its flexibility

allows to enhance tissue contrast by varying the image acquisition parameters: Adjustment

on the radio-frequency (RF), gradient pulses, and relaxation timings highlights different

components in the imaged object and produces high contrast images [Liew and Yan, 2006].

MRI became necessary for structural and functional studies of the human brain, one of the

organs arousing the most interest in modern medicine, due to provided detailed anatomical

images. Such images allow a precise analysis of cerebral structures, an essential point in

interdisciplinary technologies as computer-aided detection, diagnosis, and patient follow-up.

In general, brain image analyses are classified into two categories [Cerrolaza et al., 2012]: i)

structural analysis and ii) functional analysis.

Regarding the structural analysis, brain morphometry is perhaps the most important appli-

cation: It quantifies the morphological brain features for learning how age, gender, disease,

genetic composition, environmental exposures, treatment, among other factors, affect brain

structure [Davatzikos, 2004, Ashburner and Friston, 2000]. For this sake, image analysis

1.1 Motivation 3

methods –such as segmentation and registration– and machine learning techniques are ap-

plied to the structural volumes (three-dimensional images) and sometimes to functional vol-

umes (time-varying images). As an example application, authors in [Kuklisova-Murgasova et al., 2011]

built a 4D temporal probabilistic atlas of a neonatal brain using regression approaches to

dynamically generate prior tissue probability maps at any chosen stage of neonatal brain

development. Figure 1-1 shows the resulting temporal atlas1. This kind of brain models has

improved the survival rates of prematurely born infants, in particular for those developing

neurological problems.

Figure 1-1: A 4D dynamic probabilistic atlas of neonatal brain structures at ages of 29,

32, 35, 38, 41 and 44 weeks gestational age shown in columns from left to right,

axial view. Structure probability maps shown in rows from top to bottom: inten-

sity template, color-coding the sum of probability maps, white matter, cortical

gray matter, cerebrospinal fluid, subcortical gray matter. The cerebellum and

brainstem are not shown as they are not present in this slice.

Particularly for neuropathologies, several medical studies showed that neurodegeneration be-

gins in the medial temporal lobe, successively affecting the entorhinal cortex, the hippocam-

pus, and the limbic system, then extending towards neocortical areas [Wolz et al., 2011,

Magnin et al., 2009]. For example, early and significant hippocampal atrophy in people

who have memory complaints usually points to a diagnosis of Alzheimer’s disease (AD)

[Lotjonen et al., 2010]. Therefore, along with clinical history, neuropsychological tests, and

laboratory assessment, the joint clinical diagnosis of AD also includes imaging techniques

1Figure and caption taken from the original Kuklisova-Murgasova’s paper

4 1 Preliminaries

like PET and MRI. Nonetheless, issues related to image quality and radiologist experience

demand the automatic assessment of quantitative biomarkers to enhance the performance

for dementia diagnosis [Dubois et al., 2014, Jack et al., 2013, McKhann et al., 2011]. As a

result, estimation of morphological changes from structural data may support the computer-

aided diagnosis of neurological diseases. In these cases, pattern recognition and multivariate

data analysis methods have been used to build discriminative models [Bron et al., 2015,

Ramırez et al., 2013]. Those methods benefit from the large amounts of neuroimaging data,

available over the last years, to learn differences that clinicians pass during the qualitative

visual inspection. Consequently, earlier and more objective diagnosis than only using clinical

criteria are achieved [Kloppel et al., 2012a].

In the regional context, researchers at the Clınica de la Memoria (Universidad Autonoma de

Manizales), in partnership with the research group Signal Processing and Recognition (Uni-

versidad Nacional de Colombia), have been studying the rise of biomarkers for the attention

deficit and hyperactivity disorder (ADHD) in children. As the result of previous findings,

they are developing the research project Assisted evaluation of evoked related potentials as

a biomarker of the ADHD, where the main goal is to improve the ADHD diagnosis using a

multimodal scheme from clinical tests, brain activity recordings, and structural MRI.

On the other hand, the most widely spread functional analysis task is brain mapping, related

to the assessment of the neuronal activity from the generated synchronous, compact electric

sources located within the brain. In the standard setup, the propagation of such activity

is measured by a set of sensors located on or near a subject’s scalp with excellent tempo-

ral resolution [de Munck et al., 1988]. Electroencephalography (EEG) sensors measure the

voltage potentials using standardized electrode placement systems (see Figure 1.2(a)). Differ-

ently, magnetoencephalogram (MEG) records magnetic fields produced by electrical currents

occurring in the brain through very sensitive magnetometers. Next, the inverse modeling

estimates the unknown sources over the brain cortex best fitting obtained M/EEG, as it is

illustrated in Figure 1.2(b).

Nonetheless, M/EEG recording not only depend on the location of activated neurons (source

model) but also on the spatial arrangement of the electrodes/sensors (channel positions) and

the geometrical and electromagnetic properties of the head. A leadfield matrix contains all

these properties and describes the current flow from each source position, through the head

geometry, to a given sensor so that sources and sensors are linearly related [Montes-Restrepo et al., 2014].

In this case, the head geometry is computed from patient’s medical images (MRI/CT) by

segmenting them into components of different conductivity properties, e.g. skin, bone,

cerebrospinal fluid, and brain tissue [Darvas et al., 2004]. Hence, accurate segmentation

of head structures from imaging studies allows to construct of patient-specific conductiv-

ity matrices and improves the accuracy of the source activity modeling [Lanfer et al., 2012,

Grech et al., 2008].

1.2 Problem statement 5

Cz T4C4C3T3

Pz

Fz

T6

O2

T5

F7 F8

O1

Fp1 Fp2

F4F3

P3 P4

A1 A2

INION

NASION

(a)

MRI

CT

Channel

positionsMEG

EEG

Head model

leadfieldProcessing of

functional data

Inverse solution

Statistics/visualization

(b)

Figure 1-2: Left: Electrode locations of International 10-20 system for electroencephalogra-

phy recording. Right: Major steps in the source location for brain mapping.

In this regard, the research groups Signal Processing and Recognition (Universidad Nacional

de Colombia) and Control e Instrumentacion (Universidad Tecnologica de Pereira), together

with the Instituto de Epilepsia y Parkinson - Neurocentro, proposed the research project en-

titled Development of an automatic system for brain mapping and intraoperative monitoring:

Application to neurosurgery. The research is conducted on subjects with diagnosed epilepsy

and Parkinson and attempts to integrate both, brain mapping and surgery monitoring, into

a single framework improving clinical procedures, enhancing treatment of the neurological

diseases, and preventing surgery complications.

1.2 Problem statement

As above stated, structural and functional brain imaging are playing an expanding role

in neuroscience and experimental medicine. Nevertheless, the produced amount of data

increasingly exceeds the visual analysis capacity of expert clinicians. For instance, build-

ing reliable atlases and leadfields requires several structures from hundreds of volumes be-

ing manually delineated, and that depends on factors related to image quality and radi-

ologist experience [Dubois et al., 2014]. As a consequence, there is a growing need for

automated image analysis: Information extraction requires accurate segmentation; while

robust classification schemes may support the computer-aided diagnosis of neuropatholo-

gies [Sabuncu and Konukoglu, 2015, Heckemann et al., 2006]. Nevertheless, such procedures

6 1 Preliminaries

are not straightforward to perform due to issues related to imaging artifacts or structure

properties.

Perhaps the main artifact is a slowly varying spatial bias multiplying the measured intensities

of several imaging modalities, known as intensity non-uniformity, intensity inhomogeneity,

shading, or bias field. Some causes of the intensity inhomogeneity on MRI are the radio fre-

quency pulse attenuation, the non-uniform magnetic field, and the magnetic susceptibility

of the tissues [Vovk et al., 2007, Brinkmann et al., 1998]. Although such shading is hardly

visible and not a serious drawback for qualitative clinic diagnosis, the intensity variation sig-

nificantly hampers the precise measurement in automated processing tasks as segmentation,

registration, and classification [Demirhan and Guler, 2011, Balafar et al., 2010].

Particularly, segmentation is difficult to perform based solely on the MR intensity because

structures that are anatomically distinct do not necessarily differ in their signal properties.

For instance, with the aid of MR relaxometry, it has been proved that different regions of

white matter have significantly different T1 excitement [Cho et al., 1997]. Frontal cortex

also has an average T1 that is 20% longer than that the one in motor and somatosensory

cortex [Steen et al., 2000]. Moreover, MR intensity histograms of manually labeled brain

structures reported in [Ledig et al., 2012] evidence that boundaries between white matter

and subcortical gray matter are generally less clear than white matter and cortical gray

matter ones. Original histograms are shown in Figure 1-3. Besides, limited image resolu-

tion yields to voxels composed of more than one tissue types, know as the partial volume

effect [Ahmed et al., 2011, Heckemann et al., 2006, Liew and Yan, 2006].

Figure 1-3: Gaussian intensity distribution of manually segmented classes. CSF (black),

WM(red), deep GM structures (green), cortical GM, brainstem and cerebellum

GM fractions (blue).

Regarding the computer-aided diagnosis, using the millions of voxels on an MRI as the

straightforward input of conventional classification machines is still unrealistic because the

number of features larger than the subjects’ and the amount of parameters yield to model

1.3 State-of-the-art 7

overfitting [Ota et al., 2015]. Dimensionality reduction using feature extraction and fea-

ture selection approaches, in general, improves the generalization of pattern recognition

systems. In this regard, brain parcellation from predefined anatomical templates provides

a simple yet versatile and interpretable set of features such as volume, thickness, and

shape [Cuingnet et al., 2011, Zhang et al., 2011]. However, the diagnosis outcome still de-

pend on the quality of resulting parcellation and the template definition, which sometimes

differ from atlas to atlas even for the same structure [Ota et al., 2014, Bohland et al., 2009].

In addition, the very same nature of the disease poses a challenge. For instance, demen-

tia diagnosis from imaging studies also discriminates the mild cognitive impairment (MCI),

which is a heterogeneous and intermediate category between the healthy and Alhzeimer’s

diagnostic groups, from which subjects may convert to AD or return to the normal cogni-

tion [Kloppel et al., 2015].

1.3 State-of-the-art

In the medical image processing field, there exists a set of approaches dealing with the

bias field on the image segmentation stage. In [Wang and Wang, 2008], a fuzzy C-means

(FCM) algorithm for MRI brain image segmentation is presented which incorporates both

the local and non-local neighborhood information into clustering process to increase the

noise robustness. Another, FCM approach is introduced by [Sikka et al., 2009], where an

entropy driven homomorphic filter is used for inhomogeneity correction. A histogram-based

local peak merger using adaptive window initializes cluster centers, and a neighborhood-

based membership ambiguity correction smooths the boundaries between different tissue

clusters. [Li et al., 2008] segments brain MRI and corrects the bias field using a spatially

constrained kernel clustering algorithm. Such a kernel implicitly maps image data to a higher

dimensional space enhancing the separability. Both, the clustering and bias field correction,

are alternatively combined benefiting each other and accelerating the whole convergence.

Active contour models (ACM) have been also used for dealing with low frequency arti-

facts in the segmentation stage. Specifically, geometric active contours are implicit level

set functions defined on a higher dimension, which evolve according to a partial differen-

tial equation (PDE). Usually, the evolution equation is the minimization solution of an

energy formulation, obtained by variational calculus. For instance, a summation of edge

and region energies can be minimized in the variational scheme [Paragios and Deriche, 2002,

Sum and Cheung, 2008]. [Brox and Cremers, 2008] approximated the piecewise smooth Mumford-

Shah functional of the variational framework using local means. Other localized statistics

have been computed from convolutions achieving results similar to piecewise-smooth segmen-

tation in a much more efficient manner [Lankton et al., 2007, Lankton and Tannenbaum, 2008].

8 1 Preliminaries

In this way the variational scheme can model objects with heterogeneous statistics. The

same goal is reached by [Li et al., 2007] where a local binary fitting (LBF) energy allows

to extract local features robust to intensity inhomogeneity. Nevertheless, the number of

operations required to compute the energies implies a higher computational cost. Fur-

thermore, the algorithms for solving accurately PDEs need a large number of iterations

to converge [Wang et al., 2010].

1.3.1 Image Representation Enhancement

A practical way to enhance the image properties and ease the discrimination, especially

when a single modality is provided, is by including a voxel-wise feature extraction stage. For

instance, in [Iftekharuddin et al., 2009], two kinds of texture features are extracted for brain

tumor segmentation: The first ones, fractal-based, are estimated by using author’s Piecewise-

Triangular-Prism-Surface-Area algorithm. The second group results from the combination

of fractal and wavelet analyses. These features were extracted from T1, T2, and FLAIR MRI

modalities and fused by means of Self-Organizing Maps. [Tu and Bai, 2010] also adopted

texture features and joined Haar features in a probabilistic boosting tree, which was further

enhanced with an auto-context model iteratively refining the labeling results. The efficiency

of different feature types from multimodal MRI was later studied by [Ahmed et al., 2011].

Fractal Dimension, multi-fractional Brownian motion (mBm), level-set-based shape, and

normalized image intensity composed the set of evaluated features. The evaluation introduces

a feature selection stage using either PCA, boosting, entropy metrics, or a proposed Kullback-

Leibler-divergence-based ranking. Results showed that mBm performed the best for T1 and

FLAIR modalities while the normalized intensity was more appropriate for segmenting T2

images.

Wavelet-like transforms have been widely studied for image representation in computer vi-

sion applications and also in medical image processing. In [Demirhan and Guler, 2011],

wavelet decomposition along with its statistical information feeds a self-organizing map and

supervised learning vector quantization to segment gray and white matter from T1 MRI.

[Alzubi et al., 2011] introduces a multi-resolution analysis using wavelets, ridgelets, and

curvelets for Region-Of-Interest segmentation on medical images, particularly cancer tissue.

Such transformations exhibited suitable edge reconstruction since a directional component is

included in the traditional wavelet transform. [De et al., 2011] maximizes the image entropy

using Particle Swarm Optimization, being enhanced with an introduced wavelet-based mu-

tation operation. Reported results showed a successful lesion extraction on high contrast and

large lesions. [Cerrolaza et al., 2012] decomposes a statistical active shape model using the

wavelet transform to model object relationships at different levels of detail. The approach

properly models the localities at the cost an increase in the feature space dimension.

1.3 State-of-the-art 9

Most recently the non-local weighted label fusion segmentation approaches have promoted

the development of several voxel-wise groups of features. In general, such approaches dis-

claim the one to one registration correspondence by allowing neighboring labeled elements to

contribute to labeling a target location. Such a contribution, usually given in terms of a sim-

ilarity function, is then assessed using voxel-wise features. In the most straightforward case,

a spatially varying decision fusion measures the intensity difference between voxel-pairs after

deformable mapping as an indicator of the local registration performance [Isgum et al., 2009].

Nonetheless, more robust versions consider patch intensities around a voxel to constitute the

feature space and to estimate nonlocal means [Coupe et al., 2011, Rousseau et al., 2011].

Further, the voting weights can be computed as the combination factors of linearly re-

gressing the target patch from the surrounding ones in the atlases under some constraints

(e.g. sparsity) [Zhang et al., 2012]. The idea is lately enhanced by including a discrimi-

native dictionary learning stage [Tong et al., 2013], generative modeling [Wu et al., 2014],

a local element-wise atlas selection [Tong et al., 2015], and most recently, the authors in

[Bai et al., 2015] calculate a set of patch-based weights from contextual, gradient features.

The output of linear filters as first- and second-order differences, 3D hyperplane, Sobel, and

Laplacian have been also combined with Haar-like features in a selection stage based on

forest learning by [Ma et al., 2014].

1.3.2 Image Classification

As above stated, classifying volumes from the raw intensity is complex in several ways.

Then, an essential stage to implement is the feature extraction. As found in the literature

review, there are two kinds of mainly accepted and standard MRI features considered for

diagnosis from structural brain scans. Firstly, the structure-wise morphometry holds thick-

ness, area, and volume measurements for anatomically defined regions of interest. Such

regions of interest correspond to either the white and gray matter structures, or the corti-

cal and subcortical structures. FreeSurfer is the most known toolkit for extracting such a

feature in a fully automatic way [Jung et al., 2015, Ota et al., 2014, Westman et al., 2013,

Cuingnet et al., 2011]. Secondly, the voxel-based morphometry provides statistics at voxel-

level. The posterior tissue probability maps of gray and white matter, extracted with the Sta-

tistical Parametric Mapping (SPM), are the most considered features [Ramırez et al., 2016,

Khedher et al., 2015, Moradi et al., 2015, Ota et al., 2015, Chyzhyk et al., 2014, Falahati et al., 2014,

Cuingnet et al., 2011].

Such features have been joined with different multivariate pattern recognition (MVPR) tools

for neuroimage data classification. Reported classifiers range from conventional approaches

(k-Nearest Neighbors [Papakostas et al., 2015], Linear Discriminant Analysis [Sørensen et al., 2013],

Support Vector Machines [Kloppel et al., 2012b], Random Forests [Moradi et al., 2015], Re-

10 1 Preliminaries

gressions [Eskildsen et al., 2015]) to the combination of classifiers [Farhan et al., 2014].

[Sabuncu and Konukoglu, 2015] analyzed three representative MVPR tools for schizophre-

nia, dementia, and attention deficit and hyperactivity disorder. Authors conclude that

MVPR tools offer more accurate predictions than univariate markers while the choice of

the feature set and machine-learning algorithm has a significant impact on prediction per-

formance. In the particular case of dementia, most of above approaches were evaluated at

the 2014 CADDementia challenge for reproducing the clinical diagnosis of 354 subjects in a

multi-class classification problem of three diagnostic groups [Bron et al., 2015]: Alzheimer’s

diagnosed patients, subjects with mild cognitive impairment (MCI), and healthy controls

(NC) given their T1-weighted MRI scans. Although the best-performing algorithm yielded

an accuracy 63.0% and an area under the receiver-operating-characteristic curve of 78.8%;

attained true positive rates are 96.9% and 28.7% for NC and MCI, respectively. Such results

proved the biasing towards specific classes when there is a heterogeneous class as MCI.

Other kind of machine learning tools, the Artificial Neural Networks (ANN), have proven

to be suitably adapted to several computer-aided diagnosis tasks, presenting the following

advantages [Amato et al., 2013, Chyzhyk et al., 2014]: i) Ability to process a large amount

of data, (ii) Reduced likelihood of overlooking relevant information, and (iii) Reduction of

diagnosis time. Nonetheless, setting-up the initializing architecture (termed pre-training)

is an essential procedure for ANN implementation, being carried out the most naively

using randomly-initialized parameters. However, this strategy performs poorly in prac-

tice [Vincent et al., 2010]. For improving each initial-random guess, a local unsupervised

criterion can be assumed to pre-train each layer stepwise, aiming to produce a useful higher-

level description from the lower-level representation output by the previous layer. Particu-

lar examples that use unsupervised representation learning are the following: Restricted

Boltzmann Machines [Hinton et al., 2006], autoencoders [Bengio et al., 2007], sparse au-

toencoders [Ranzato et al., 2007], and the greedy layer-wise that is the most common ap-

proach that learns one layer of a deep architecture at a time [Bengio, 2012]. Although

the unsupervised pre-training generates hidden representations that are more useful than

the input space, many of the resulting features may be irrelevant for the discrimination

task [Weston et al., 2012, Mohamed et al., 2011].

1.4 Objectives 11

1.4 Objectives

1.4.1 General Objective

To develop a segmentation and classification framework using kernel tools supporting par-

tition, clustering, and classification of magnetic resonance images of the human brain. The

framework must include priori tissue distributions for enhancing the feature extraction of

atlas-voting strategies. In addition, demographic data must be included into the learning

stages introducing a subject-dependant selection of templates and extracting discriminative

biomarkers for clinical prediction of diseases from brain structural information.

1.4.2 Specific objectives

• To develop an unsupervised kernel-based image representation for 3D volumes for clus-

tering anatomically similar subjects. The proposed representation must highlight the

inherent image distributions while reducing the feature space dimension. Additionally,

the image metric induced in the new space must support template selection for an

atlas-based image segmentation approach.

• To propose a learning methodology for voxel representation using kernel-devoted cost

functions enhancing intrinsic tissue features. The scheme must include local intensity

distribution and supervised tissue information. Resulting features must improve seg-

mentation performance of atlas voting strategies and provide robustness under image

artifacts.

• To build a supervised scheme for biomarker extraction from structural brain MRI

scans supporting clinical prediction tasks. Extracted biomarker from prior subject

information must enhance MRI discrimination and provide data interpretability.

1.5 Contributions

Taking into account the results of the proposed models, we highlight the following contribu-

tions of this thesis:

• A new kernel-based representation of 3D volumes is introduced from the

inherent Inter-Slice Kernel (ISK) relationship aiming to improve MRI dis-

crimination and highlight brain structure distributions. Specifically, we pro-

12 1 Preliminaries

pose three different types of ISK-based feature representations to estimate pairwise

MRI similarities using generalized Euclidean metrics. We tune all needed metric pa-

rameters by means of a centered alignment approach, so that the obtained kernels

resemble the most prior demographic information. The proposed approach is tested on

MRI data discrimination using patient demographic information categories (namely,

age and gender patient). As a result, our proposed discriminative representation prove

to be useful for MRI clustering tasks, while properly supporting atlas selection ap-

proaches.

• An unsupervised cost function, termed kernel function estimation based

on information potential variability maximization (KEIVP), is proposed

for choosing kernel parameters. The model assumes that a Reproduced Kernel

Hilbert Space maximizing the whole information potential variability, also highlights

the intrinsic data distribution. Therefore, we start from a Parzen-based probability

density function estimator and develop the updating rule for the bandwidth of a Gaus-

sian kernel using a finite dataset.

• Renyi’s α entropy is proposed as the cost function of the unified framework

for atlas-based segmentation of brain structures on MRI. This sort of function

leads to more discriminative tissue distributions than the standard maximum likeli-

hood, since the latter relies on the assumption that tissue properties do not overlap

significantly which is far from being realistic.

• A voxel-wise feature extraction methodology using linear projections is de-

veloped in a supervised scheme for supporting a patch-based non-local seg-

mentation approach. In this regard, we generalize convolution-based representations

(like gradients, Laplacians, and non-local means) and spatially adapt the feature rep-

resentation, relying on the fact that structure distributions may vary along the image

domain. Linear projections are calculated by maximizing the affinity between label

and feature distributions.

• An entropy operator for matrices is introduced as the cost function for

optimizing a non-linear image representation. Such a representation, based

on multi-layer perceptrons, supports MRI classification tasks while performing as a

dimension reduction for the original image domain.

2 Background

This chapter overviews the technical background on the concepts considered for the devel-

opment of this thesis. Firstly, we provide the mathematical background on the use of kernel

functions for machine learning applications. Added to that, our proposed criterion to es-

timate Gaussian kernel functions from given samples is introduced. Secondly, we formally

define the image segmentation task from the pattern recognition point of view. Finally,

we describe the considered image databases related to brain structure segmentation and

classification. The contents of this chapter are based on the works of Ashburner and Fris-

ton [Ashburner and Friston, 2000], Aljabar [Aljabar et al., 2009] Principe [Principe, 2010],

Coupe [Coupe et al., 2011], and Alvarez [Alvarez-Meza et al., 2014] .

2.1 Reproducing kernel Hilbert spaces in machine learning

It is universally acknowledged that the study of positive definite kernels is a topic of interest

for the machine learning community as a generalization of a well body of theory that has

been developed for linear models. In this way, a positive definite kernel κ is an implicit

way to represent the samples of the input space X . Owing to there is a correspondence

between κ and a Reproducing kernel Hilbert space (RKHS) of functions H , the kernel can

be understood as an indirect way to compute inner products between elements of a Hilbert

space that are the result of mapping the elements of X to H . So, there is a mapping

function ϕ : X → H such that:

κ (x, x′) = 〈ϕ(x), ϕ(x′)〉H . (2-1)

Regarding this, the space H can be viewed as a feature space and ϕ is called the feature

map. Consequently, by performing linear operations in H it is possible to perform nonlinear

manipulations in the input space X (see Figure 2-1). As a rule, it holds that |X |→∞,

so that |X | ≪ |H | can be assumed. In practice, there is no need to perform any explicit

computations in H .

An important property associated with the use of positive definite kernels in machine learning

is the so-called representer theorem[Kimeldorf and Wahba, 1971, Scholkopf and Smola, 2002]:

14 2 Background

xx

x

x

φφ(x)

φ(o)

φ(o)

φ(o) φ(o)

φ(x)

φ(x)

φ(x)o

oo

o

Input space RKHS

Figure 2-1: Kernel-based mapping.

Theorem 2.1.1 Let Θ : [0,+∞) → R be a strictly monotonic increasing function, X be a

set, and ǫ : (X ×R2)N → R∪∞ be an arbitrary loss function. Then, each minimizer f∈H

of the regularized risk functional:

ǫ ((x1, y1, f(x1)), . . . , (xN , yN , f(xN))) + Θ(‖f‖2

H

), (2-2)

admits a representation of the form:

f(x) =N∑

n=1

αnκ(xn, x), (2-3)

where each yn∈R is a given output associated with the input xn∈X .

Proof 2.1.1 Let S=spanκ(xn, ·) : xn∈X , n∈[1, N ] denotes the subspace of H spanned

by the N training samples. Consider the solution f∈H , this solution can be written as:

f=fS + fS⊥, where fS∈S, fS⊥∈S⊥, and ⊥ stands for the orthogonal symbol. Consequently,

f(xn)=fS(xn) + fS⊥(xn)=fS(xn) + 0. Now, for the second term of the regularized risk func-

tional:

Θ(‖f‖2H

)= Θ

(‖fS‖

2H + ‖fS⊥‖2H

),

since Θ is strictly monotonic increasing it is possible to see that the minimum will be achieved

for ‖fS⊥‖=0, which implies that fS⊥=0.

With this in mind, it is possible to conclude that the representer theorem basically states

that the solution of the minimization of the regularized risk functional can be expressed

in term of the so-called training sample (xn, yn) : n∈[1, N ]. Therefore, it allows us to

deal with problems that a first glance appear to be infinite dimensional. Nonetheless, the

regularization does not prevent of having local multiple minima, such a property requires

some extra conditions, namely, convexity.

2.2 Gaussian kernel estimation from information potential variability 15

2.2 Gaussian kernel estimation from information potential

variability

Let X∈X be a system in the representation space X . Renyi’s entropy, given in Equa-

tion (2-4), quantifies the level of randomness of X .

Hα (X) =1

1− αlog(Ex

p(x)α−1

)(2-4)

In practice, p(x) can be estimated from a set of N samples X=xn∈X : ∀n∈[1, N ], by

using the Parzen’s nonparametric probability density function estimation:

p(x) ≈ pX(x|σX) = En κ (x− xn) , (2-5)

where κ (·)∈R+ is a symmetric kernel function and notation En · stands for averaging

operator along theN samples. Though there are many feasible kernel functions, the Gaussian

is commonly preferred because of its universal approximating capability [Liu et al., 2011].

In this case, the Gaussian kernel can be defined for the input domain X as:

κG (x− x′; σX) = exp

(−‖x− x′‖2

X

2σ2X

), (2-6)

where ‖ · ‖X is a given norm in X .

Provided the observation set X and based on the Parzen’s estimation of Equation (2-5), we

get the following estimator of the Renyi’s α-order entropy [Principe, 2010]:

Hα (X) ≈ Hα (X|σ) =1

1− αlog (Vα (X|σ)), (2-7)

where the so termed information potential (IP) Vα(X|σ) of the set X is defined as follows:

Vα(X|σ) = En vα (xn|σX) , (2-8)

being vα (xn|σ) the IP of the sample xn, which can be computed as:

vα (xn|σX) =1

Nα−1

N∑

n′=1

(κG (xn − xn′ ; σX))α−1. (2-9)

Equation (2-9) lets infer that IP yields an entropy estimate based on the summation of pair-

wise sample interactions through the Gaussian kernel function [Morejon and Principe, 2004].

Also, the Information Force (IF), Fn∈X , is defined as the force acting on the particle xn

due to all other particles in X and is given by the derivative of the IP with respect to xn.

16 2 Background

Particularly, for the case of α=2, the well-known quadratic Renyi’s entropy leads to the

following estimation of the IF:

Fn =∂

∂xnV2(X|σX) = −

1

NσX

∑

xn′∈X

κG (xn − xn′ ; σX) (xn − xn′)

= En′ F (xn|xn′) (2-10)

F (xn|xn′) =1

Nσ2X

κ (xn − xn′; σX) (xn − xn′) , (2-11)

where F (xn|xn′) corresponds to the conditional IF acting on xn due to xn′ . Generally, the IFs

can be interpreted in light of inner products in a high dimensional feature space [Jenssen et al., 2003].

Some important facts have to be highlighted from Equation (2-10):

• Firstly, given that X is fixed and the factor (xn − xn′) points towards xn, all IF direc-

tions are also fixed and attracting-natured.

• Secondly, since Fn turns out to be dependent on the free parameter σX , the IP and

all IF magnitudes become functions of the Gaussian kernel bandwidth. In fact, the IP

follows a monotonically decreasing behavior over σX .

• At the same time, the conditional IF magnitude tends to zero as σX goes either to zero

or infinite and reaching its maximum at some value in R+.

Hence, the importance of an adequate Gaussian kernel bandwidth tuning becomes clear. In

this sense, we seek for an RKHS maximizing the overall IP variability with respect to the

kernel bandwidth parameter so that all IF magnitudes spread the most widely on X . To

this end, the variability of the estimated IP is maximized in terms of the kernel bandwidth

parameter as follows:

σ⋆X = argmax

σX

Var v2(x|σX) (2-12)

Var v2(x|σX) = Ex

(Var v2(x|σX)− Ex Var v2(x|σX))

2 . (2-13)

Deriving Equation (2-13) with respect to σX , the optimal parameter value can be rewritten

in terms of the above introduced Gaussian-based Renyi’s entropy as follows:

d

dσXVar v2(x|σX) =

2

N2σ3X

(1 +

1

N

)( N∑

n,n′=1

κ2G (xn − xn′; σX) ‖xn − xn′‖2

X

−

(N∑

n,n′=1

κG (xn − xn′ ; σX)

)(N∑

n,n′=1

κG (xn − xn′ ; σX) ‖xn − xn′‖2X

)),

=2(N2 +N)

σX

(σ2X

N∑

n,n′=1

F 2(xn|xn′)− V2(X)

N∑

n,n′=1

(F (xn|xn′))⊤(xn − xn′)

)

2.3 Template-based Image Segmentation 17

Lastly, equating the above equation to zero, a fixed point or a gradient descent update rule

can be employed to find a suitable σX value. As a result, we get a scale updating rule as

a function of the IFs, which are induced by a kernel function applied over a finite sample

set. Thereby, a Gaussian kernel-based RKHS coding the most spread out IF magnitudes

can be estimated using the introduced approach, termed as: Kernel function Estimation

from Information Potential Variability - KEIPV. Figure 2-2 illustrates the influence of the

bandwith on the data representation as well as the result of the KEIPV on a synthetic

dataset.

2.3 Template-based Image Segmentation

Let an intensity image be a set of spatially-arranged real-valued measurements, X=xr∈R:r∈Ω.

The bounded domain ,Ω⊂RD, is defined over a D-dimensional space and the elements r are

commonly known as pixels, for 2D images, or voxels, for 3D volumes.

Image segmentation task consists on partitioning the intensity image into multiple segments,

each of them more meaningful and easier to analyze than the original image. We formally

define the task as assigning a single label, lr∈L, to each element on Ω based on its location

and intensity (r and xr) aiming to obtain the segmentation image, L=lr:r∈Ω, also known

as partition or label image. The set L, of size |L|, holds the possible labels.

Template-based segmentation approaches make use of a template (atlas) or set of templates

containing a priori information about shape, position, and/or topology about the imaged

structures of interest. Generally, an atlas is described by an intensity image, X, a tissue

membership map, B=brl:r∈Ω, l∈L, and a spatial domain, Ω. We recognize two kind of

atlases depending on their membership map definition: Deterministic atlases assume a rough

membership, i.e., brl∈0, 1 , whereas probabilistic atlases are constrained to brl=[0, 1] and∑l∈L brl=1.

The latter kind of atlases are built by spatially normalizing and averaging of a set of

N anatomical atlases A=Xn,Ln,Ωn:n=1, . . . , N. Spatial normalization requires a re-

parameterization of the domains into a common coordinate space, Ω′, so that each coordi-

nate r indexes the same anatomical position in each image. Formally, this requires a set of

functions τn:Ωn→Ω′; r 7→τn(r) such that each mapping τn holds the coordinate transfor-

mation between the domain Ωn and the canonical configuration. In computational anatomy,

the mappings are Cn smooth, invertible, and topology preserving transformations avoiding

emergence of holes. Hence, application of the mappings over corresponding membership map

18 2 Background

(a) (b)

(c) σ=6.13× 10−3 (d) σ=6.13× 10−1 (e) σ=6.13× 10

(f) σ=6.13× 10−3 (g) σ=6.13× 10−1 (h) σ=6.13× 10

Figure 2-2: KEIPV illustrative example. a) Multivariate Gaussian toy set. b) log of IP

variability versus bandwidth. 2nd row: Gaussian kernel for the toy set. 3rd

row: IFs acting on a fixed particle (green). Narrow (1st column), KEIVP

(2nd column) and wide (3rd column) bandwidth values.

2.3 Template-based Image Segmentation 19

allows to build the average anatomical template, also known as tissue probability map:

B =

brl =

1

N

N∑

n=1

(bnl ∗ g)r : r∈Ω, l∈L

, (2-14)

where g is a convolution function usually introduced for smoothing purposes and ∗ is the

spatial convolution operator. In general, τn may be an assemble of an affine transformation

(accounting for translation, rotation, scale, and skew) or a non-linear function finely aligning

the images [Ashburner, 2007, Avants and Gee, 2004].

Atlas space Query space

X1,B1,Ω1

X2,B2,Ω2

Xn,Bn,Ωn

XN ,BN ,ΩN

X1, B1

X2, B2

Xn, Bn

XN , BN

Lq

τ1

τ2

τn

τN

ν1sl

ν2sl

νnsl

νNsl

Figure 2-3: Schematic illustration of template-based segmentation using multiple atlases.

Each anatomical atlas, An, is registered to the query anatomy Xq. Resulting

transformation τn is used to map the corresponding tissue membership map Bn

to the query space Ωq. Transformed segmentations Bn are combined through

the weighting function νnsl to create an estimate of the query segmentation Lq.

In the first stage, provided atlases are registered to the target image for propagating the labels

to the target spatial domain. Afterwards, labels are fused into a single class at location r.

Figure 2-3 illustrates the template-based segmentation procedures, which are outlined as

follows:

a) Image registration computes the spatial transformations τn maximizing the alignment

between Xn and the query image Xq [Zitova and Flusser, 2003].

b) Label propagation maps each n-th template in the spatial domain of the query image

Ωq through the transformation τn, yielding the designated membership set Bn=bnrl=bnr′l:

r′=τn(r)∈Ωq.

20 2 Background

c) Voxel classification supplies the estimated label representation Lq by gathering the mem-

berships assigned to a voxel bnrl into a single label lqr∈L following the general rule

[Wu et al., 2014]:

lqr =argmaxl

1

N

N∑

n=1

∑

s∈Br

νnslb

nsl (2-15)

s.t.νnsl ≥ 0,

where Br⊂Ω is a spatial neighborhood around r and νnsl∈R

+ is the weighting function.

Equation (2-15) allows to include most of the template-based segmentation criteria into a

single expression. Some particular cases include:

• Majority voting: Br=r and νnsl=constant.

• Global Weighted Fusion: Br=r, νnsl=νn=f(Xq,Xn) depends on the atlas but not on

the location and it is usually constrained to∑N

n=1 νn=1.

• Nonlocal means: Br=s:‖s− r‖Ω≤β is a closed ball of radius β, and νnsl=f(xq

r, xns ) is

a function depending on the local affinity. This kind of approaches are also know as

patch-driven segmentations.

• Parametric Intensity Modeling: Br=r, and νnrl = p(xr|lr = l) defined as the tissue

conditional probability with a distribution function f(xr|θl) parameterized by θl. In

this case, segmentation rule in Equation (2-15) is rewritten as the maximization of the

posterior probability [Ashburner and Friston, 2005]:

lqr = argmaxl

p(lr = l|xr) (2-16)

p(lr = l|xr) ∝ p(xr|lr = l)p(lnr = l) (2-17)

p(lnr = l) = brl =1

N

N∑

n=1

bnrl (2-18)

p(xr|lr=l) = f(xr|θl) (2-19)

where p(lnr=l)=brl becomes the tissue probability map (TPM) and the parameter set

θl is derived from the query image intensities.

2.4 Magnetic resonance image databases

This section describes the magnetic resonance image (MRI) databases considered for the

development of this manuscript. Such a databases have been widely used for training and

2.4 Magnetic resonance image databases 21

evaluating medical image processing methods and they are publicly available (in some cases

freely acquired under request). Figure 2-4 exemplifies an image of each collection.

BrainWeb database The Simulated Brain Database or BrainWeb contains a set of realistic

MRI data volumes produced by an MRI simulator. The neuroimaging community evalu-

ates the performance of various image analysis methods on such a database given that the

ground-truth is known [Cocosco et al., 1997, Kwan et al., 1996]. The Internet connected

MRI Simulator, at the McConnell Brain Imaging Centre in Montreal, has two publicly avail-

able1 pre-computed MRI sets:

SBD1: 18 simulated images, from the same subject, with dimensions of 181 × 217 × 181

voxels of 1× 1× 1mm. Simulated contrasts are T1-weighted, T2-weighted, and proton

density (PD). The T1-weighted image was simulated as a spoiled FLASH sequence,

with a 308 flip angle, 18ms repeat time, 10ms echo time. Three different levels of

image nonuniformity ( 0, 40, 100% RF ) and six noise levels ( 0, 1, 3, 5, 7, 9% relative to

the brightest tissue in the images ) are simulated to attain 18 volumes.

SBD2: 20 simulated images with constant image quality (3% noise, 0% image nonuniformity)

and varying anatomy. Each volume was generated based on the anatomical model of

different individuals with normal brain. Provided ground-truth volumes are 362×434×

362-sized and intensity volumes 181× 256 × 256-sized. Ground-truth is resampled to

corresponding intensity volume resolution aiming to perform voxel-wise comparisons.

For both sets, ground-truth images hold 10 structures, namely, white matter, gray matter,

cerebrospinal fluid, skull, bone marrow, dura matter, fat, connective tissue, muscles, and

skin. Further details on the simulation process can be found on [Aubert-Broche et al., 2006].

OASIS database: The Open Access Series of Imaging Studies is a project aimed at making

MRI data sets of the brain freely available to the scientific community. Washington Univer-

sity Alzheimer’s Disease Research Center, the Howard Hughes Medical Institute (HHMI) at

Harvard University, the Neuroinformatics Research Group (NRG) at Washington University

School of Medicine, and the Biomedical Informatics Research Network (BIRN) compiled

and made OASIS freely available. The study holds MRI from 416 subjects, aged 18 to 96

years old, including diagnosed very mild dementia (70), mild dementia (28), moderate de-

mentia (2), and healthy (316) subjects [Marcus et al., 2010]. For each subject, three or four

T1-weighted MR scans obtained within a single imaging session are included, from which

a motion-corrected co-registered average image is obtained. Additionally, each subject is

provided with segmented gray matter (GM), white matter (WM) and cerebro-spinal fluid

1http://brainweb.bic.mni.mcgill.ca/brainweb/

http://brainweb.bic.mni.mcgill.ca/brainweb/

22 2 Background

(CSF) structures. A fourth label (BG) is assigned to voxels with no label to model the image

background.

SATA database: The MICCAI 2013 challenge workshop on Segmentation: Algorithms,

Theory, and Applications2 provides a framework for comparing atlas-based segmentation

methods on three standardized datasets. We are interested in the third challenge for the

brain labeling, in which the diencephalon is labeled into 14 regions of interest: Accum-

bens, amygdala, caudate, hippocampus, pallidum, putamen, and thalamus (in all cases left

and right). This collection contains training and testing sets of 35 and 12 brain T1 MR

scans, respectively. The training set contains both intensity and label images with the 14

annotated structures. The testing set contains only the intensity images. Pairwise nonrigid

alignments, among the training and between training and testing volumes, are also provided

for a competition on standardized registration.

IXI database: The Information eXtraction from Images dataset is a brain imaging study

holding MR images from 575 normal subjects aging between 20 and 80 years. Subjects

are provided with T1, T2, PD, DTI, and angiogram volumes. All image sequences were

obtained with three different scanners (Philips 1.5T, Philips 3T, and GE 3T) at different

hospitals in London, and further anonymised and converted to NIFTI format. Additionally,

basic demographic information for each subject is included (age, gender, ethnicity, handiness,

among others). The whole dataset is publicly available online3.

ADNI database: The Alzheimer’s Disease Neuroimaging Initiative database4 was launched

in 2003 by the National Institute on Aging (NIA), the National Institute of Biomedical

Imaging and Bioengineering (NIBIB), the Food and Drug Administration (FDA), private

pharmaceutical companies and non-profit organizations. The primary goal of ADNI is to test

whether serial magnetic resonance imaging (MRI), positron emission tomography (PET),

other biological markers, and clinical and neuropsychological assessment can be combined to

measure the progression of mild cognitive impairment and early Alzheimer’s disease. ADNI

database is split into three phases, namely ADNI 1, ADNI 2, and ADNI GO. In all phases

subjects are imaged up to six times using the same scanner: the baseline visit, 6, 12, 18, 24,

and 36 months after baseline. At each visit, subjects are clinically evaluated to label them as

Normal Control (NC), Mild Cognitive Impairment (MCI), Alzheimer’s disease (AD). Clinical

evaluation includes the mini-mental state examination (MMSE), widely used to assess the

mental status, where the maximum score is 30, and a 23 points or less is an indicative of

2https://masi.vuse.vanderbilt.edu/workshop2013/index.php/Main_Page3http://www.brain-development.org/4adni.loni.usc.edu

https://masi.vuse.vanderbilt.edu/workshop2013/index.php/Main_Page

http://www.brain-development.org/

adni.loni.usc.edu

2.4 Magnetic resonance image databases 23

cognitive impairment [Folstein et al., 1975]. From the three phases, we selected a subset of

633 subjects with scans that had been noted with the “best” quality mark. As a result, the

selected cohort holds N=1993 images with the three class labels described above. Besides,

629 images with a “partial” quality mark were selected in order to assess the classification

performance under more complicated imaging conditions. Table 2-1 briefly describes the

demographic information for the ADNI selected cohort.

“best” quality “partial” quality

NC MCI AD NC MCI AD

N 655 825 513 465 130 34

Age 74.9± 5.0 74.4± 7.4 74.0± 7.4 76.6± 6.4 76.0± 6.3 74.3± 6.5

Male 47.5% 39.5% 47.6% 70.1% 62.3% 70.6%

MMSE 29.0± 1.0 27.1± 2.5 21.9± 4.4 27.5± 2.0 21.2± 1.6 14.4± 2.8

Table 2-1: Demographic and clinical details of the selected ADNI cohort.

242

Back

ground

Figure 2-4: Examples of MRI databases. Top to bottom and left to right: SDB1 for three noise levels (1, 5, 9%), SDB2 simulated

structures, OASIS, SATA (overlaying labeled structures), and IXI.

2.4Magn

eticreson

ance

image

datab

ases25

N Age (yo) Male Structures Modalities T1 vol. size T1 vox. size

SDB1 1 - - 10 whole brain T1, T2,

PD

181×217×181 1×1×1mm

SDB2 20 29± 4 50.0% 10 whole brain T1 181×217×181 1×1×1mm

OASIS 416 52± 25 38.4% WM,GM,CSF T1 208×176×176 1×1×1mm

SATA 47 - - 14 diencephalon T1 256×256×287 1×1×1mm

IXI 619 49± 16 44.7% - T1, T2, 256×256×130 to 0.93×0.93×1.2mm

PD, MRA, 256×256×150 0.97×0.97×1.2mm

DWI

ADNI 2622 71± 7 50.1% - T1 192×192×160 to 0.93×0.93×1.2mm to

(55, 91) 256×256×180 1.25×1.25×1.2mm

Table 2-2: Summary of characteristics of the considered MRI databases.

3 Kernel-based Template Selection from

the using Embedding Representations

As previously stated, templates provide shape, intensity, and/or functional information re-

garding imaged structures. For segmentation tasks, a set of templates are properly reg-

istered to a query image or used to build the prior spatial distribution for each tissue.

In either case, usage of the whole set of templates assumes unimodal shape distributions.

Therefore, accomplished solutions may be biased towards anatomically unrepresentative im-

ages [Valdes-Hernandez et al., 2009]. In constrast, template selection improves segmentation

performance, in terms of computational cost and accuracy, when only representative images

from large datasets are propagated [Aljabar et al., 2009].

A new Kernel-based Atlas Image Selection, computed in the Embedding Representation

space (termed KAISER), is introduced in this chapter supporting MRI segmentation. The

approach encodes inter-slice similarities for each volume to keep main shape information

on a lower dimensional representation. Then, a tensor-product kernel properly combines

multiple representations into a single metric. Finally, a spectral decomposition of the dataset

estimates a compact embedding space, for data visualization, where latent data structure is

highlighted.

3.1 Template selection for image segmentation

Let an atlas dataset A=An:n∈[1..N ] be composed of N tuples An=(Xn,Bn,Ωn, cn) hold-

ing an intensity image Xn, a tissue membership map Bn, a spatial domain Ωn, and a

demographic category cn∈C (e.g. age, gender, disease). The selection framework ranks the

atlas subjects according its similarity with the query Q=(Xq,Ωq, cq) as follows:

Aq = An∈A : s(Am,Q) ≥ ǫ , (3-1)

being s(·, ·)∈R+ an atlas similarity function and ǫ∈R+ a predefined threshold value. Then,

top ranked atlases are aligned to the query to segment it in its native space. Recalling

3.1 Template selection for image segmentation 27

from Equation (2-15), template-based segmentation using atlas selection is further written

in terms of the weighting value νnsl as:

νnsl = νn =

1 : s(An,Q) > ǫ

0 : i.o.w.(3-2)

Similarity function s(·, ·)∈R+ can be either assessed using the meta-information or the in-

tensity images. In the first case, such a function measures how closely atlas subjects match

the query in terms of the demographic variable c: sC(An,Q)=s(cn, cq). In the second case,

the similarity is derived from the intensities images in a standard spatial domain Ω (e.g.,

Sum of Squared Differences, Correlation Coefficient Histogram, and Normalized Mutual In-

formation) or using a feature vector y extracted from each image X.

3.1.1 Feature extraction based on inter-slice similarities

Let the bounded spatial domain for 3D volumes, Ω ⊂ R3, be indexed by a three-component

vector r=taua+ tsus+ tcuc, where uv∈R3 is the orthonormal vector along the axis v, namely

Axial (v=a), Sagittal (v=s), or Coronal (v=c). Since any intensity image X=xr∈R : r∈Ω

is a finite set of measurements indexed by r, one may assume that tv∈[1..Lv], with Lv as

the volume size along the axis v. In addition, any volume can be expressed as a set of P

different non-overlapping partitions:

X =Xp ∈ R|Ωp| : p∈[1..P ] (3-3)

Xp =xr ∈ R : r ∈ Ωp

s.t.

P⋃

p=1

Ωp = Ω

Ωp ∩ Ωp′ = ∅∀p 6= p′

being |Ωp| the cardinality of the p-th partition (number of voxels). The set Ωp is provided

as high-level segmentation or low-level regions for any spatially normalized anatomic image.

Equation (3-3) allows to encode spatial relations by introducing the following similarity

metric:

ypp′ = κP (Xp,Xp′) = 〈ϕ (Xp) , ϕ (Xp′)〉H ∈ R+ ∀p, p′∈[1..P ], (3-4)

with the function ϕ : R|Ωp| → H mapping each partition Ωp to a Hilbert Space H reproduced

by the kernel function κP (·, ·), and 〈·, ·〉H stands for the inner product on H.

Aiming the function in Equation (3-4) to model the spatial dynamics on the imaged objects,

we split the volumes into ordered slices, of size Lv′×Lv′′ , smoothly varying along the canonical

28 3 Kernel-based Template Selection from the using Embedding Representations

axis v by defining the partitions as:

Ωvp = r=puv + tv′uv′ + tv′′uv′′ : tv′∈[1..Lv′ ], tv′′∈[1..Lv′′ ] ∀p∈[1..Lv] (3-5)

Therefore, application of κP (·, ·) over the above partition yields to the Inter-Slice Kernel

(ISK) features of the image X along the v-th axis:

yv = κP

(Xv

p ,Xvp′

)∈R+ : p∈[1..Lv], p ≥ p′. (3-6)

It has to be pointed out that the symmetric property of equality of the kernel functions

(i.e., κP (Xp,Xp′) = κP (Xp′ ,Xp)) allows to only consider p ≥ p′ for building the ISK. As a

result, the similarity function in Equation (3-2) can be assessed in the new feature space:

sv(An,Q) = s(yv

n,yvq ) (3-7)

After spatial normalization, the relations among the images in A can be encoded into a

symmetric positive definite kernel matrix Kv∈RN×N by defining the similarity in the ISK

space as:

sv(An,Am) = κv(yv

n,yvm) = 〈ϕ (yv

n) , ϕ (yvm)〉H ∈ R

+ ∀n,m∈[1..N ], (3-8)

being κv (·, ·) a positive definite and infinitely divisible kernel function producing the matrix

Kv with elements κvnm=κv(yv

n,yvm)∈R

+.

Due to ISK represents the high-dimensional image information along a single axis, different

dynamics are highlighted by changing the axis view. Hence, we introduce the Tensor-Product

Kernel Representation (TKR) to join the similarity measures in Kv matrices:

sX(An,Am) = κT

nm =∏

v∈a,s,c

(κv(yvn,y

vm))

θv , (3-9)

where θv∈R+ weighs the contribution of κv to the TKR, and KT=κT

nm∈R+:n,m∈[1..N ]

holds the joint similarity of dataset A.

To overcome the deleterious effect on the TKR, due to κTnm→0 as kv

nm→0, the influence

of Kv is decreased by θv→0, so that (kvnm)

θv→1. Besides, positive definite and infinitely

divisible properties of kernels in Equation (3-8) allow fixing arbitrary powers, θv, so that the

resulting TKR in Equation (3-9) is always positive definite.

3.2 Experiments and Results

In this chapter, we consider the IXI dataset for demographic analyses, while SDB2, OASIS

and SATA for evaluating the atlas selection performance in different segmentation tasks.

3.2 Experiments and Results 29

3.2.1 MRI Preprocessing

Three preprocessing steps are carried out over all image data: Images are firstly spatially

normalized to the Talairach space to compare them and extract their features within a

standard space. To this end, rigid registration to the MNI305 atlas is applied to each

volume using the quaternion-based mapping and the mutual information (MI) metric of

the Advanced Normalization Tool (ANTS). As a result, volumes are re-sampled to 197 ×

233 × 189 size (MNI305 template size). Secondly, an intensity normalization is performed

by scaling each voxel value, so that the mean intensity of the white matter is equal for all

images [Fischl et al., 2002]. Such a step is applied by Freesurfer, a freely available image

analysis suite1.

3.2.2 ISK feature extraction

The ISK feature vector of an image Xn is noted as yvn∈R

Lv(Lv−1)/2. Hence, a new represen-

tation space of order 104 is achieved, instead of the original image domain of order 106. The

kernel function κP is chosen as the well-known Gaussian function, noted as follows:

κP

(Xn

p′ ,Xnp

), exp

(−‖Xn

p′ −Xnp ‖

2F

2σ2P

), (3-10)

where notation ‖ · ‖F stands for the Frobenius norm and σP∈R+ is the kernel bandwidth

parameter, which is tuned up according to the KEIPV criterion in Equation (2-12). Fig-

ure 3.1(a) illustrates the tuning curve for the ISK bandwidth parameter. Since the sampling

rates and image dimensions are similar for the three axes, the dynamic range of the inter-slice

differences is also the same. As a result, the KEIPV converges near the same bandwidth for

all axes. Resulting ISK representations for an MRI on the IXI database using the optimal

bandwidth are shown in Figures 3.1(b) to 3.1(d). The red corner patches on the matrices

encode MRI regions with no content, i.e., the background. Since the sagittal ISK exhibits

symmetry respect to the anti-diagonal, such representation properly encodes the head sym-

metry along the sagittal axis.

3.2.3 Image similarity function from TKR

Although the latent phenomenon is the same for all ISK, each of them provides a different

view of the data distribution. Hence, aiming to include all the information into a single simi-

larity function, we compute the Tensor-product Kernel Representation using Equation (3-9).

1http://surfer.nmr.mgh.harvard.edu/


20 σ⋆P 40 60 80 100

5 · 10−2

0.1

0.15

0.2

Bandwidth σ

Var(IP)

(b) Axial (c) Sagittal (d) Coronal

Figure 3-1: Top: Bandwidth vs. KEIPV cost function for the IXI database vol-

umes. Mean and standard deviation values are plotted. Bottom: ISK

representation along each view for an IXI subject.


In addition, we propose to fix αv parameters depending on their corresponding Information

Potential Kv under the assumption that high variability should identify MRI discriminative

patterns:

αv =V2 (K

v)∑v∈a,s,c

V2 (Kv), (3-11)

where the operator V2 (·) measures the Information Potential of a kernel matrix and it is

defined in Equation (2-8). The resulting parameter setup for the IXI subjects is: αa=0.32,

αs=0.35, and αc=0.33. Figure 3-2 shows the attained kernel matrices using each ISK rep-

resentation and TKR, sorting the subjects by gender and age. We also show the Sum of

Squared Differences (SSD) as baseline similarity metric. As seen, both categories, gender

and age, are highlighted using the kernel representations, so evidencing some patterns in the

MRI distribution.

Male

Female

Male Female20 86 21 80

20

8621

80

(a) ISK Axial

Male

Female


20

8621

80

(b) ISK Sagittal

Male

Female


20

8621

80

(c) ISK Coronal

Male

Female


20

8621

80

(d) SSD

0

1

Male

Female


20

8621

80

0.2

0.4

0.6

0.8

(e) TKR

Figure 3-2: Resulting kernel representations for the IXI database using ISK, TKR and SSD.

Colormap is normalized to [0, 1] for all matrices.

To visually identify MRI patterns, we estimate a low-dimensional space from KT using

Kernel Principal Component Analysis (KPCA). Figure 3-3 compares TKR and SSD using


their three largest principal components. From TKR-based projection (Figure 3.3(b)), the

following statements rise: i) The third eigenvector is mainly related to gender discrimination;

ii) The first and second eigenvectors are nonlinearly related and both of them unfold the

age category; and iii) Older subjects are wider spread than younger ones. This last finding

agrees the anatomical head knowledge: Brain anatomy is steady on middle age humans and

changes (gray matter volume diminishes) faster on older humans [Aljabar et al., 2009]. As

a result, our proposal naturally highlights principal subject categories (age and gender), so

better representing inter-subject relations than SSD.

M

F

30

40

50

60

70

80

1-t

h c

oord

inate

2-t

h c

oord

inate

3-t

h c

oord

inate

1-th coordinate 2-th coordinate 3-th coordinate

(a) SSD-based projection

M

F

30

40

50

60

70

80

1-th coordinate 2-th coordinate 3-th coordinate

1-t

h c

oo

rdin

ate

2-t

h c

oo

rdin

ate

3-t

h c

oo

rdin

ate

(b) TKR-based projection

Figure 3-3: IXI database projections using Kernel Principal Component Analysis from SSD

and TKR similarities.

3.2.4 Tissue Labeling Performance

Proposed Kernel-based Atlas Image Selection from the Embedding Representation (KAISER)

is evaluated in two tissue labeling tasks: The first one consists in segmenting cerebrospinal

fluid, gray matter, and white matter of the SBD2 collection (the query) using OASIS images as

templates. To this end, the Diffeomorphic Anatomical Registration using Exponentiated Lie

algebra (DARTEL) algorithm aligns the subset selected images and builds a tissue probabil-

ity map (TPM) for each query. Then, resulting TPMs are fed into the unified segmentation

tool of the Statistical Parametric Mapping (SPM), which is a probabilistic generative ap-

proach estimating the query labels. The second task aims to segment the 14 diencephalon

structures on the training SATA images in a leave-one-out validation. In this case, a weighted

majority voting scheme is adopted for segmentation where template-to-query TKR similarity

weighs the contribution of each atlas. In this case, the database provides pair-wise non-rigid

registrations among training images, so no further alignment is required.


Segmentation performance is measured in terms of the well-known Dice similarity index:

DIl = Eq

200

|Bql ∩B

ql |

|Bql |+ |Bq

l |

(3-12a)

DI = El DSIl , (3-12b)

where Eq · and El · stand for the averaging operator along the assembly of query images

and along the tissue types, respectively. Bql is the binary ground-truth for l-th tissue, Bq

l

its resulting estimation, and DIl its attained Dice index. If there is no agreement between

the ground-truth and the attained segmentation, DIl is minimum (0%); in the case of total

agreement, DIl is maximum (100%). Figure 3-4 compares KAISER against SSD, mutual

information (MI), and random (Rand) selection approaches in an incremental search for

the optimum number of atlases. The average Dice index and its standard deviation for

each tissue and considered selection approaches are plotted in Table 3-1 where parenthesis

indicates the optimal number of selected templates |Aq|. Obtained results show how KAISER

not only reaches the maximum accuracy at fewer atlases than baseline approaches but also

selects a subset of images better performing than larger sets. Particularly for SDB2, the

performance measure for all approaches converges to the same value since TPMs are built

on a selection-only scheme. On the other hand, SATA results are not the same when using

the whole dataset due to the weighted voting dependent on the similarity measure. In this

regard, tensor kernel representation induces an image similarity improving segmentation

performance.

50 100 150 200 250 |Aq|90

91

92

93

DI

RandSSDMIKAISER

(a) SDB2

5 10 15 20 25 30 |Aq|81

83

85

87

RandSSDMIKAISER

(b) SATA

Figure 3-4: Template selection performance for considered image similarities. Average Dice

similarity index is depicted along the number of selected templates for both

segmentation tasks.


Table 3-1: Template selection performance for each tissue using optimal number of atlases.

SDB2 database

Rand (240) SSD (240) MI (70) KAISER (60)

Gray matter 91.9± 1.7 92.5± 1.7 93.0± 1.7 93.2± 1.7

White matter 94.8± 1.3 95.3± 1.4 95.7± 1.4 95.9± 1.4

CSF 89.0± 2.9 88.1± 2.8 88.2± 3.5 88.5± 3.6

Average 91.9± 2.0 92.0± 2.0 92.3± 2.2 92.5± 2.2

SATA database

Rand (34) SSD (24) MI (18) KAISER (13)

Accumbens 78.0± 0.9 79.0± 1.0 78.7± 1.0 79.3± 0.8

Amygdala 80.3± 0.6 80.8± 0.6 80.9± 0.6 80.8± 0.5

Caudate 82.3± 1.3 83.6± 1.3 83.9± 1.3 86.5± 0.8

Hippocampus 83.8± 0.6 84.6± 0.6 84.9± 0.5 85.1± 0.5

Pallidum 88.2± 0.5 88.5± 0.5 88.7± 0.5 88.2± 0.5

Putamen 92.1± 0.3 92.4± 0.3 92.6± 0.3 92.2± 0.3

Thalamus 91.2± 0.3 91.6± 0.3 91.9± 0.3 91.9± 0.2

Average 85.1± 0.6 85.8± 0.7 85.9± 0.6 86.3± 0.5

3.3 Summary

A kernel-based image representation is introduced to support MRI atlas selection in a

template-based segmentation of brain structures. The proposed approach firstly encodes

smooth MRI inter-slice variations using a kernel function (ISK), which can be related to the

brain structure distribution. Besides, ISK along each canonical axis (Axial, Coronal, and

Sagittal) are further combined into a single Tensor-product Kernel Representation (TKR)

inducing pairwise image similarities.

Computed pairwise image kernel, shown in Figure 3-2 and sorted by gender (firstly) and age

(secondly), shows a stacked-block-like shape, leading to assume that there are similarities

encoding demographic groups. Later, the KPCA-based projection, provided in Figure 3-3,

evidenced that TKR highlights inherent image distribution, particularly age and gender

relations. Indeed, TKR enhances both, data interpretability and separability, using patient

demographic information in comparision to standard sum of squared differences.

The capability of the proposal to support image segmentation is evaluated on two query

image collections (the synthetic SDB2 and the SATA for diencephalon) and two segmentation

approaches (a probabilistic generative and a weighted majority voting). To this end, the

similarity metric embedded in the TKR space (KAISER) is used to select templates and

weigh its contribution to the labeling. Obtained results lead to the following conclusions:

3.3 Summary 35

i) KAISER outperforms other selection strategies as SSD and MI; ii) there exists a small

number of templates performing better than the whole dataset so avoiding the computational

cost of pairwise image registration; and iii) large atlas sets bias the segmentation towards

the average population.

4 Information-based cost function for

Bayesian MRI segmentation

An information-based cost function is introduced for learning the conditional class probabil-

ity model required in the probabilistic atlas-based brain magnetic resonance image segmenta-

tion. Aiming to improve the segmentation results, the α-order Renyi’s entropy is considered

as the function to be maximized since this kind of functions has been proved to lead to more

discriminative distributions. Additionally, we developed the model parameter update for

the considered function, leading to a set of weighted averages dependent on the α factor.

Our proposal is tested by segmenting the synthetic BrainWeb MRI database and compared

against the standard log-likelihood function. Achieved results show an improvement in the

segmentation accuracy of ∼ 5% with respect to the baseline cost function.

4.1 Bayesian Image Segmentation

As seen in Section 2.3, one may write the voxel classification rule in Equation (2-15) as the

maximization of the tissue posterior probability [Ashburner and Friston, 2005]:

lqr = argmaxl

p(lr = l|xr) (4-1a)

lqr = argmaxl

p(xr|lr = l)p(lr = l), (4-1b)

where p(lnr=l) = brl = (1/N)∑N

n=1 bnrl is built from the averaging a set of spatially normalized

label atlases to become the prior Tissue Probability Map (TPM) and p(xr|lr=l) = frl(θ) fol-

lows a predefined probability distribution parameterized by θ. In the most common scheme,

the set of model parameters θ is found by maximizing the probability of the whole voxel set

(i.e. the volume intensities) under the assumption of having just independent voxels:

P (X) =∏

r∈Ω

∑

l∈L

frl(θ)brl (4-2)

4.1 Bayesian Image Segmentation 37

In practice, the negative log-likelihood is used to optimize the parameter set θ as an smoother

equivalent of Equation (4-2):

L(X) = −∑

r∈Ω

log

(∑

l∈L

frl(θ)brl

)(4-3)

Instead of using the common log-likelihood as the cost function, we propose to maximize

the amount of information contained in the image X. To this end, we consider the α-order

Renyi’s entropy the cost function to be maximized with respect to the set of parameters θ

as follows:

maxθ

Hα(X) ≡ minθ

−1

1− αlog

(∫

Ω

pα(xr)

)(4-4)

4.1.1 Parameter optimization

As commonly assumed, we consider the conditional tissue probability to follow a normal

distribution, so that:

frl(θ) = γlN (xr|µl, σ2l ) =

γl√2σ2

l

exp

(−|xr − µl|

2

2σ2l

), (4-5)

being µl and σ2l the intensity mean and variance for the class l, respectively. γr∈[0, 1] is the

prior probability of any voxel, irrespective of its intensity, to belong to the l-th class, and it

is subject to∑

l∈L γl=1. Consequently, the parameter set becomes θ=γl, µl, σ2l :l∈L.

Here, Expectation-Maximization (EM) algorithm minimizes a given energy function E with

respect to the parameters θ and an introduced distribution Q=qrl∈[0, 1]:∀r∈Ω, l∈L:

−Hα ≤ E = −Hα +C∑

c=1

Dα (qrl||p(lr = c|xr)) (4-6)

This new energy function works as an upper bound on the proposed cost function and it

is composed of two terms. The first one only considers the α-order Renyi’s entropy, while

the second term, Dα(·‖·), corresponds to the α-order Renyi’s divergence between the poste-

rior probability and the introduced distribution Q. Using Equation (4-6), the optimization

problem in Equation (4-4) is replaced by:

minθ,Q

E =−1

1− αlog

(∫

Ω

pα(xr)

)+∑

l∈L

∫

Ω

1

α− 1log

(qαrl

pα−1(lr = c|xr)

)(4-7)

EM optimization iteratively updates the parameters by alternating between the following

two steps:

38 4 Information-based cost function for Bayesian MRI segmentation

E-step: The energy function E is minimized w.r.t. Q. Since the first term does not depend

on Q, the introduced distribution only minimizes the α-divergence and the problem becomes:

minQ

E ≡ minQ

∑

l∈L

∫

Ω

1

α− 1log

(qαrl

P α−1(lr = c|xr)

)

Given that the α-divergence function is at minimum Dα=0, the solution for Q is:

qrl = p(α−1)/α(lr = l|xr). (4-8)

In addition, the property∑

l∈L p(lr=l|xr)=1 yields to the following restriction:

∑

l∈L

qα/(α−1)rl = 1. (4-9)

M-step: The energy function E is minimized w.r.t. θ. Given the results of the E-step, the

Renyi’s divergence is zero whenever relation in Equation (4-8) holds. Hence, the optimization

problem for the M-step, as in Equation (4-4), consists in only minimizing the Renyi’s entropy.

Using the Bayes’ theorem, the properties of the conditional probability and the results in

Equations (4-8) and (4-9), one can introduce the distribuion Q into Hα(X):

Hα(X) =−1

1− αlog

(∫

Ω

pα(xr)

)=

−1

1− αlog

(∫

Ω

pα(xr, lr = l)

pα(lr = l|xr)

)

=−1

1− αlog

(∫

Ω

pα(xr|lr = l)pα(lr = l)

q2α/(α−1)rl

)

=−1

1− αlog

(∫

Ω

∑

l∈L

qα/(α−1)rl

pα(xr|lr = l)pα(lr = l)

q2α/(α−1)rl

)

Hα(X) =1

α− 1log

(∫

Ω

∑

l∈L

fαrlb

αrl

qαrl

)(4-10)

As a result, the optimization for the M-step is now rewritten as:

minθ

−1

1− αlog

(∫

Ω

∑

l∈L

fαrlb

αrl

qαrl

)(4-11)

Taking into account that α∈[0, 1), the minimization of the function in Equation (4-11) is

equivalent to maximize the argument of the log function, known as the Information Potential

(IP), as follows:

V(X) =

∫

Ω

∑

l∈L

(frl(θ)brl

qrl

)α

4.1 Bayesian Image Segmentation 39

Given that we can only measure a finite number samples in the image, the IP for a given

image X is approximated as:

V(X) ≈ V (X) =∑

r∈Ω

∑

l∈L

(frlbrlqrl

)α

(4-12)

Finally, optimal parameter values are found where the derivatives of V with respect to θ are

zero:

dV (X)

dθ= α

∑

r∈Ω

(brlqrl

)α

f(α−1)rl

dfrldθ

= 0

By differentiating Equation (4-12) with respect to the mean µl, the following expression is

attained:

dV (X)

dµl= α

∑

r∈Ω

(brlfrlqrl

)α(xr − µl)

σl, (4-13)

that being solved for dV/dµl=0 results in the updating rule:

µ(k+1)l =

∑r∈Ω

(brlfrlqrl

)αxr

∑r∈Ω

(brlfrlqrl

)α (4-14)

Likewise, the derivative of V with respect to the variance parameter σ2l is:

dV (X)

dσ2l

=α

2

∑

r∈Ω

(brlfrlqrl

)α((xr − µl)

2

σ2l

− 1

). (4-15)

Hence, the variance is updated in accordance to the following rule:

(σ2l )

(k+1) =

∑r∈Ω

(brlfrlqrl

)α (xr − µ

(k+1)l

)2

∑r∈Ω

(brlfrlqrl

)α (4-16)

Following the above derivative scheme, the attained updating function for the prior param-

eter γl is given by:

γ(k+1)l =

∑r∈Ω

(brlfrlqrl

)αN (xr|µl, σ

2l )

∑r∈Ω

(brlfrlqrl

)α (4-17)

As a result, EM algorithm updates Q, using Equation (4-8), and µl, σ2l , γl, using Equa-

tions (4-14), (4-16) and (4-17), alternately until convergence criteria are met.


4.2 Experiments and Results

The experiments in this chapter are carried out on SBD1 and SBD2 collections for segmenting

five compartments, namely, white matter (WM), gray matter (GM), cerebrospinal fluid

(CSF), skull (SK), and scalp (SC). Also, the prior probability maps (TPMs) provided by the

SPM software are used to segment the volumes [Ashburner and Friston, 2005]. We assess

the segmentation performance using the Dice Similarity Index defined in Equation (3-12a).

4.2.1 Evaluation of performed segmentation

Firstly, we analyze the influence of the α order in the optimization process. Figure 4-1 depicts

the information-based cost function as a function of the number of iterations for several α

values. As expected, the relation between the entropy order is Hα(X)<Hα′(X):0≤α′<α<1.

Such inequality means that the larger the entropy order, the smaller its value. Moreover,

the EM algorithm converges faster for smaller orders.

0 5 10 15 20 25 30 35 40 45 5010

12

14

Number of iterations

Hα

α = 0.1α = 0.2α = 0.3α = 0.4α = 0.5α = 0.6α = 0.7α = 0.8α = 0.9α = 0.99

Figure 4-1: α-order Renyi’s entropy versus the number of iteration for the optimization

procedure, for several α values and a given image in the dataset

The entropy influence on the segmentation performance is given in the Figure 4-2 showing

the DI versus α for the available noise intensities in SDB1. As seen, the α order leads

the accuracy, so that the Dice index is larger for mid range orders than for small or large

ones. Moreover, the highest segmentation accuracy is generally achieved at α=0.5. It is also

important noting that algorithm is robust under conventional noise levels. For too large noise

levels (9%) the performance markedly decreases, since such levels introduce high variations

on the tissue distributions so difficulting the algorithm convergence.

Finally, we compare attained results against the log-likelihood cost function in Equation (4-3).

4.3 Summary 41

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.75

0.8

0.85

0.9

α

κ1%3%5%7%9%

Figure 4-2: Average Dice similarity index versus the entropy order for available image noise

intensities.

The achieved segmentation accuracy is computed using optimal α=0.5 for each considered

structure. As shown in Table 4-1, the proposed Renyi’s entropy outperforms the baseline

log-likelihood.

Proposed Entropy Baseline Log-likelihood

Noise 1% 3% 5% 7% 9% 1% 3% 5% 7% 9%

Average 85.38 85.22 84.66 84.69 82.78 81.08 80.25 79.94 79.73 77.88

SC 88.61 89.28 86.86 86.92 88.11 84.43 84.97 82.17 82.87 83.15

SK 67.73 67.90 68.40 68.46 68.46 63.16 63.28 64.17 64.26 63.65

CSF 73.03 73.15 71.66 71.57 68.59 68.52 68.60 67.30 67.28 64.35

GM 89.60 88.50 89.73 89.72 84.15 84.88 84.18 85.39 84.89 79.27

WM 90.22 89.12 90.51 90.53 85.77 85.72 85.08 85.85 86.11 80.99

Table 4-1: Dice index for each structure at optimal α = 0.5

4.3 Summary

In this chapter, we have discussed the use of information-based measures into the parameter

optimization scheme for MRI segmentation. In particular, we introduce the α-order Renyi’s

entropy as a new cost function for finding the tissue distribution parameters under the

assumption of normally distributed classes. Additionally, we have developed the model of

updating equations for an EM-based optimization using the considered function. As a result,

parameters are updated from weighted averages (see Equations (4-14), (4-16) and (4-17) ),

where the influence of the r-th voxel for each parameter is given by (brlfrl/qrl)α .


As seen in Figure 4-1, we prove the relationship between two different entropy orders. We

show that in the range [0, 1], the larger the order, the smaller the information measure. In

fact, the maximum possible value for the Renyi’s entropy is achieved when α = 0, corre-

sponding to H0 = − log(

1|Ω|

). We also find a proportional relationship between the order

and the algorithm convergence iterations. The above is due to the influence of α in the

probability values: The entropy tends to weigh all the events (the voxels) more evenly as α

tends to zero, regardless their probability, i.e. (brlfrl/qrl)α → 1. On the other hand, for large

α values, the entropy is determined by the most probable events.

Regarding the segmentation results in Figure 4-2, we obtain the maximum performance at

α=0.5. For such a value, we compare the proposed cost function against the log-likelihood

as the baseline approach. Achieved Dice indexes, in Table 4-1, show that our scheme out-

performs the baseline since the obtained parameters for the entropy function are more dis-

criminative than those for the log-likelihood.

5 Multi-atlas label fusion using

supervised local weighting

This chapter introduces a multi-atlas weighted label fusion approach that takes advantage

of the supervised fusing labels to improve the segmentation accuracy of brain MR images.

Namely, we employ the knowledge about the neighborhood as well as the patch structure to

be segmented. To this end, we assume a voxel-wise feature extraction procedure based on the

spatially-varying linear combination of patch intensities (like gradients, Laplacians, and non-

local means). Parameters of such a linear combination are locally computed in a supervised

learning scheme by maximizing the match between the local labels and the extracted features,

aiming to attain a more discriminating voxel representation. Particularly, we make use of

the centered kernel alignment (CKA) criterion assessing the correlation between a couple of

kernel matrices [Cortes et al., 2012]. Then, we benefit from the neighborhood-wise analysis

by providing more information about the tissue structure localities and reducing the influence

of small target-atlas registration issues.

5.1 Feature-based label fusion within α-neighborhoods

From a pattern recognition point of view, label fusion builds a set of discriminative functions,

noted as G=gl(r):Ω→R+, ∀l∈L, scoring the membership of r-th voxel to the l-th class so

that the larger the score, the more likely the given voxel belongs to such class. Consequently,

each voxel label is attained as:

lqr = argmaxl∈L

gl(r) (5-1)

In practice, registration issues, imaging artifacts, and intricate shape structures degrade the

label estimation. To further increase the segmentation performance, discriminating functions

44 5 Multi-atlas label fusion using supervised local weighting

are written as a majority voting scheme depending on neighboring voxels:

gl(r) =1

N

N∑

n=1

∑

s∈Br

νns b

nsl (5-2)

νns = ν (yq

r ,yns ) ; ∀s∈Br, n∈[1, N ], (5-3)

being yns∈Y a feature vector, extracted at location s from the n-th intensity image, Br=s∈Ω

q:

‖s−r‖Ω≤α a neighborhood centered at r, with a radius α∈R+. The scalars νns ∈R

+ mea-

sures the similarity between voxels r and s in the feature space Y through the function ν(·, ·).

Notation ‖ · ‖Ω stands for the norm defined on the spatial domain Ω, and ‖ · ‖2 the L2-norm.

We then rewrite the discriminating functions in terms of the weighting factors as gl(r)=ν⊤r drl,

where νr=νns ∈R

+:s∈Br, n∈[1..N ], with νr∈RS, holds the weights of S labeled voxels con-

tributing to the segmentation of the query image at location r (S ≤ N |Br|), and the vector

drl∈0, 1S comprises the votes for the l-th tissue as drl=b

nsl:s∈Br, n∈[1..N ]. Consequently,

the segmentation problem stated in Equation (5-1) becomes:

lqr = argmaxl∈L

ν⊤r drl, (5-4)

and the weighting factors in Equation (5-4) now depend on the similarity to the query voxel

r in the feature space Y.

5.1.1 Supervised feature learning based on centered kernel alignment

Here, we propose to compute the feature vectors in Equation (5-3) using the following pro-

jection:

yns = Wrx

ns , (5-5)

where xns=x

nt :‖t−s‖Ω≤β, t∈Ω denotes the vector containing the p∈Z+ voxel intensities

within a patch centered at s with radius β∈R+, termed the β-patch, and the h×p-sized matrix

Wr=wut∈R:u=[1..h], t=[1..p] linearly projects the β-patch to the h-dimensional feature

space Y. Therefore, we may assess the space-varying weights in Equation (5-3) using a

Gaussian kernel as below:

νns (Wr) = exp

(−‖Wrx

qr −Wrx

ns‖

22

2σ2

); ∀s ∈ Br (5-6)

In the case of Wr=W , projection matrix can be seen as a set of h 3D convolution masks

with radius β, e.g., averaging, Laplacian, and gradient.

5.1 Feature-based label fusion within α-neighborhoods 45

For the resulting feature space to improve tissue discrimination, the joint information be-

tween the feature vectors and their corresponding labels should be maximal. A practical

estimation of such a joint information is given by the centered kernel alignment (CKA)

score [Cortes et al., 2012]:

ρ (Y,B) =Eyy′,bb′

κY

(yns ,y

n′

s′

)κB

(bns , b

n′

s′

)√

Eyy′κ2Y

(yns ,y

n′

s′

)Ebb′

κ2B

(bns , b

n′

s′

) , (5-7)

where κ(·, ·) is the centered version of the kernel function κ(·, ·) given by:

κ (z, z′) = κ (z, z′)− Ez′ κ (z, z′) − Ez κ (z, z

′)+ Ezz′ κ (z, z′) . (5-8)

Therefore, ρ∈[0, 1] is an estimate of the statistical dependence between feature and label

spaces (Y and L) so that the more similar the pairs between interspace variables, the larger

the ρ score.

Here, the function κB(bns , b

n′

s′ )=δ(‖bns − bn′

s′ ‖2) defines the label agreement between any pair

of voxels, and δ(·) is the delta function. In addition, the pairwise feature similarity can

be computed using the Equation (5-6) as κY

(yns ,y

n′

s′

)=exp

(−‖Wrx

n′

s′ −Wrxns‖

22/(2σ

2)).

Finally, we enclose all the feature and label similarities in the following kernel matrices:

KY (Wr, σ) = κY

(yns ,y

n′

s′

): s, s′ ∈ Br;n, n

′ ∈ [1, N ] (5-9a)

KL =κB(b

ns , b

n′

s′ ) : s, s′ ∈ Br;n, n

′ ∈ [1, N ]

(5-9b)

Both kernel matrices, KY∈[0, 1]S×S andKB∈[0, 1]

S×S, allow to empirically estimate the CKA

in accordance to [Brockmeier et al., 2014]:

ρ (Wr, σ) =〈KX , KL〉F√

‖KX , KX‖F‖KL, KL‖F, (5-10)

where 〈·, ·〉F and ‖·, ·‖F stand for the matrix-based Frobenius inner product and norm, re-

spectively. Notation K corresponds to the centered version of the kernel matrixK calculated

as K=HKH . H=I−1⊤1/S is the empirical centering matrix, I∈RS×S the identity matrix,

and 1∈RS an all-ones vector.

Thus, the maximization of Equation (5-10) with respect to Wr generates weighting factors

enhancing the label discrimination. Accordingly, the use of the introduced supervised feature

space into the majority voting scheme is designated as Centered Kernel Alignment-based

Label Fusion (CKA-LF ).


5.1.2 CKA-LF optimization using gradient descent

The explicit objective function of the empirical CKA in Equation (5-10) yields:

F (Wr, σ) = − log (tr (KY (Wr, σ)HKLH)) (5-11)

+ 12log (tr (KY (Wr, σ)HKY (Wr, σ)H)) + ρ0,

where ρ0∈R is a scalar independent on Wr. Then, a two variable maximization problem

holds so that the projection matrix and the Gaussian kernel bandwidth σ are optimized. We

estimate each parameter alternately in an iterative approach: Fixing σ, the gradient descent

optimization updates Wr using the derivative of the objective function Equation (5-11) with

respect to Wr given by:

∂F (Wr)

∂Wr= −4X⊤

r

((G KY )− diag

(1⊤ (G KY )

))Yr, (5-12)

with diag(·) and denoting the diagonal operator and the Hadamard product, respectively.

The matricesXr∈RS×p and Yr∈R

S×h are built from neighboring voxels as: Xr=xns :s∈Br, n∈[1..N ]

and Yr=yns=Wrx

ns :s∈Br, n∈[1..N ]. G∈RS×S is the gradient of the objective function with

respect to KY , computed as:

G =dF

dKY=

HKLH

tr (KYHKLH)−

HKYH

tr (KYHKYH). (5-13)

Wr is then updated following the rule:

W t+1r = W t

r − µt∂L (Wr)

∂W tr

, (5-14)

being µt∈R+ and W tr the learning rate and the estimated projection at iteration t, respec-

tively. Then, fixing Wr, we setup the bandwith. Such a parameter allows scaling all pairwise

distances on the projected space Yt, so we estimate σt through the Kernel Function Esti-

mation from Information Potential Variability-(KEIPV) criterion introduced in Section 2.2.

Thus, we maximize the overall variability of the so termed information potential of projected

samples Yr with respect to the kernel bandwidth parameter spreading the magnitude of the

information forces more widely.

5.2 Experimental Setup

Since the contribution of the current chapter centers on the label fusion stage, atlas selection

is carried out using the KAISER approach proposed in Chapter 3. Then, patch-based

segmentation of a query image follows the procedure in Figure 5-1: i) A subset of S voxels

5.2 Experimental Setup 47

Mapping LearningInput Atlases

Input Query Label FusionPatch

Mapping

xn

xq yq

Wr yn

Figure 5-1: Proposed multi-atlas patch-based label fusion scheme.

is extracted from the fixed α-neighborhood from the atlas set. Each voxel is described by its

label and its β-patch (red patches in the Input Atlases block); ii) The linear mapping, Wr,

is then computed from those samples by minimizing Equation (5-11) (Mapping Learning

block); iii) Query patches are extracted from the target image for all voxels belonging to

the α-neighborhood (blue patches in the Input Query block). Following, all feature vectors

are computed using the previously learned projection Wr; iv) Voting weights are estimated

using the similarity function in the feature space defined in Equation (5-3); Lastly, v) a

majority voting is carried out at each query voxel taking into account the voting weights

according to Equation (5-4).

5.2.1 Algorithm parameter setup

Firstly, we adjust the size of the neighborhoods and the patch as the most critical parameters

of the proposed CKA-LF approach. So, the assumed description of voxels by their appro-

priate β-patches leads to the following limitations: The larger the patch - the higher the

computation cost; besides, the more remote the elements - the less their relevance. With the

above in mind, parameter setup is carried out on an exhaustive search scheme in the ranges

(α, β)∈1, 2, 3 as illustrated in Figure 5-2. We also compare two well-known patch-based

approaches, namely, patch similarity and regression-based voting. Exhaustive search results

in the following optimal parameter setup: (3, 1) for similarity, (2, 1) for regression and (2, 2)


for proposed CKA-LF.

86.9 87.1 87.2

86.6 87.0 87.2

86.0 86.9 87.2

1 2 3

1

2

3

α radius

βradius

(a) Similarity

87.4 87.4 87.2

87.1 87.4 87.2

86.7 87.3 87.2

1 2 3

α radius

(b) Regression

88.0 87.2 86.4

88.4 88.4 88.1

88.3 88.0 87.9

1 2 3

α radius

(c) CKA-LF

Figure 5-2: Parameter tuning for patch-based approaches by exhaustive search.

Exemplary of the label similarity matrix, non-projected weights (i.e. Wr=I), and supervised

weights (after CKA-based projection) is provided in Figures 5.3(a) to 5.3(c), respectively.

All values are computed for the optimal patch radius over the above sample subset. For the

sake of easier visualization, the voxels are sorted with respect to their tissue label. As noted

in Figure 5-3, the supervised weights discriminate tissues better than the similarity assessed

in the patches.

BG

Acc

Amy

Caud

Hipp

o

Pall

Put

Thal

BG

Acc

Amy

Caud

Hippo

Pall

Put

Thal

(a) Reference kernel KL

BG

Acc

Amy

Caud

Hipp

o

Pall

Put

Thal

(b) Patch-based kernel KX

BG

Acc

Amy

Caud

Hipp

o

Pall

Put

Thal

(c) Feature-based kernel KY

Figure 5-3: Resulting kernel matrices for a random subset of voxels before and after learning

the projection matrix Wr. Voxels are sorted by tissue type.

On the other hand, the influence of the β-patch radius on the resulting labeling is illustrated

in Figure 5-4 for a given region of interest, where the mislabeled pixels are marked in red.


As seen, β=0 performs the worst (see Figure 5.4(a)) since there is no patch information

to compute the needed projection. As detailed in all subplots of Figure 5-4, the larger the

patch radius - the more accurate the label fusion, showing the benefit of incorporating spatial

information into the voxel representation. However, the segmentation accuracy reduces for

excessively large patches (see Figure 5.4(d)) because of the following two convergence issues:

Firstly, the size of the projection matrix grows geometrically, complicating the convergence

of the CKA to a suitable maximum. Secondly, the larger the radius, the more complex the

distribution of the patch vectors. Hence, the proposed linear projection is not able to find a

weight function properly encoding the supervised information.

5.2.2 Patch-based segmentation performance

For comparing the performed segmentation of a couple of images, we consider the Dice

similarity index (DI) previously defined in Equation (3-12b). Segmentation performance

achieved by the examined methods are shown in Table 5-1 as the average and standard de-

viation values of DI. Template selection results from Table 3-1 are also included to evaluate

the benefit of patch-based approaches. Although proposed CKA-LF generally outperforms

compared approaches, Nucleus Accumbens and Pallidum structures are better segmented on

a regression scheme. It is important to highlight that the best Amygdala extraction is at-

tained by the voxel-wise weighted majority voting (the baseline approach). Such a structure

is anatomically connected to the hippocampus and their contrast in an MRI is low. Conse-

quently, we consider that in this particular case including neighboring information reduces

the segmentation performance.

Voting Similarity Regression CKA-LF

Accumbens 80.8± 0.5 80.8± 0.7 81.2± 0.6 80.3± 0.8

Amygdala 86.5± 0.8 82.9± 0.4 83.1± 0.4 82.9± 0.5

Caudate 85.1± 0.5 90.6± 0.5 91.3± 0.4 92.5± 0.6

Hippocampus 85.1± 0.5 87.0± 0.3 87.6± 0.3 89.3± 0.4

Pallidum 88.2± 0.5 88.5± 0.4 88.7± 0.4 88.4± 0.6

Putamen 92.2± 0.3 92.3± 0.3 92.5± 0.3 94.0± 0.5

Thalamus 91.9± 0.2 92.7± 0.2 92.8± 0.2 95.0± 0.2

Average 86.3± 0.5 87.8± 0.4 88.2± 0.4 88.9± 0.5

Table 5-1: Dice index scores for considered approaches and structures. Mean and standard

deviation along the subjects are depicted.


(a) β = 0, DI = 86.8

(b) β = 1, DI = 87.8

(c) β = 2, DI = 89.0

(d) β = 3, DI = 88.8

Figure 5-4: β radius effect in a subject’s volume. Mislabelings are plotted in red for the

Axial (left), sagittal (center) and coronal (right) axes.

5.3 Summary 51

5.3 Summary

From the above carried out validation, the following aspects emerge as relevant in developing

the CKA-LF method:

Firstly, the construction of the voting function measures the similarity between all feature

vectors extracted from the linear mapping of the patch representation. Such a mapping is

accomplished by the projection matrix, which is learned in a supervised scheme so that the

feature and label relations resemble the most. To this end, we maximize the centered kernel

alignment criterion that is introduced to estimate the affinity between pairs of similarity

matrices. Thus, the use of the CKA criterion and the linear projection of the patches

(see Figure 5-3) allows building a voting function that becomes highly related with the

tissue labels, increasing the class discrimination.

The second aspect is the tuning of the patches and the size of the neighborhoods as the

parameters having a strong influence on the estimation of the mapping function. Specifically,

the former parameter determines the projection domain and the intensity variability inside

each patch. At the same time, the latter one establishes the number of available samples

for estimating the projection matrix as well as the shape changes within the neighborhood

allowing to cope with small registration issues.

Therefore, the lack of patch information in the computed projection decreases the achievable

accuracy of label fusion when β→0 as seen in Figure 5.4(a). Nonetheless, the performance

again reduces for the very large patches (see Figure 5.4(d)) due to the geometrically growing

size of the resulting projection matrix and the complex distribution of the patch vectors.

That is why, the algorithm can not produce a projecting matrix suitable aligning features

and labels, and hence, the yielded weight functions do not encode the supervised information

properly. We also investigate the effect of the searching neighborhood size α on the voting

function estimation. Although small neighborhood radii should provide more robustness

to low-frequency artifacts, they lead to a lack of patches and a poorly estimated Wr. By

contrast, large values of α produce a large amount of patches and, therefore, the shape

variability increases so that the projection matrix calculation is more complex and yields to

suboptimal solutions.

6 Magnetic resonance image

classification using kernel-enhanced

neural networks

Several computer-aided dementia diagnosis methods have been proposed to discriminate be-

tween patients with Alzheimer’s disease (AD), mild cognitive impairment (MCI), and healthy

controls (NC) given their MRI scans. Nonetheless, the problem is particularly challenging

because the heterogeneous and intermediate nature of MCI. To cope with this issue, we

benefit from the artificial neural network (ANN) advantages for complex classification tasks

and introduce a novel supervised pre-training approach devoted to the automated demen-

tia diagnosis. The proposal initializes an ANN based on linear projections to achieve more

discriminating spaces. Such projections are computed by maximizing the centered-kernel

alignment criterion that assesses the affinity between the data kernel matrix and the label

target matrix. As a result, the linear embedding allows accounting for features that con-

tribute the most to class discrimination. We contrast the proposed approach against two

unsupervised initialization approaches (autoencoders and Principal Component Analysis),

and against the best four performing classification methodologies from the 2014 CADDemen-

tia challenge. As a result, our proposal outperforms all the baselines (7% of classification

accuracy and area under the ROC curve) at the time it reduces the class biasing.

6.1 Multi-layer perceptron-based classifier using kernels

Within the classification framework, a multi-layer perceptron (MLP), with L layers, is as-

sumed to predict a class c∈C, within a set of labels C, through a battery of feedforward

deterministic transformations, which are implemented at the hidden layers hl:l∈[1..L] by

mapping an input sample z∈Z to the network output hL as below [Bengio, 2009]:

hl = φl(sl), ∀l∈[1..L]

sl = bl +W lhl−1

h0 = z

(6-1)

6.1 Multi-layer perceptron-based classifier using kernels 53

where bl∈RPl+1 is the l-th offset vector, W l∈RPl+1×Pl the l-th linear projection, and Pl∈N

the size of the l-th layer. The function φl(·)∈R applies saturating, non-linear, element-wise

operations. The first layer in Equation (6-1) (that is, h0∈RP ) is fixed to the P -dimensional

input feature vectors so that Z ⊂ RP . In turn, the output layer hL∈RC , with C=|C|, works

as an estimator of the posterior class probability p(c|z) when the last saturating function

φL(·) is subject to the following constraints:

φL(uc) ∈ [0, 1] (6-2a)

∑

c∈C

φL(uc) = 1 (6-2b)

so that the maximum a-posteriori classification criterion holds:

c⋆ = argmaxc∈C

p(c|z) = argmaxc∈C

hLc (z). (6-3)

To train an MLP-based classifier a set of input samples Z=zn∈RP :n∈[1..N ] along with their

corresponding expected outputs c=cn∈C:n∈[1..N ] are provided and a predefined cost func-

tion L(HL(Z), c)∈R is minimized with respect to the set of parameters θ=W l, bl:l∈[1..L].

6.1.1 Matrix-based entropy as a cost function for MLP

Here, we extend the matrix-based entropy as a cost function for learning MLP parameters.

At this point, we take the definition of the α-order entropy of a symmetric positive definite

matrix, proposed by [Giraldo and Principe, 2013]:

Sα(K) =1

1− αlog (tr (Kα)) (6-4)

Therefore, we look for the set of parameters minimizing the matrix conditional entropy

(MCE) of the expected outputs c given the MLP evaluations hLn(zn) as:

minimizeθ

L = Sα(HL|c)

subject to hl = φl(sl)

sl = bl +W lhl−1,

(6-5)

Now we translate this problem to kernel matrices as: Let the matrix KL∈RN×N encode

the similarity between projected sample pairs, hLn ,h

Lm, with elements kL

nm=κ(eLnm) and

eLnm=hL

n − hLm, and the output similarities be assessed by kC

nm=κC (cn, cm) , enclosed in

KC∈RN×N . Then, the conditional entropy of the data assembly can be computed as:

Sα(HL|c)=Sα(NKL KC)− Sα(K

L), (6-6)

54 6 Magnetic resonance image classification using kernel-enhanced neural networks

where corresponds to the Hadamard product.

Here, the optimization problem is solved using a gradient descent approach with back-

propagable parameter derivatives. To this end, we firstly take advantage of the spectral

properties of the kernel matrices to compute the derivative of the matrix entropy in Equa-

tion (6-4) with respect to K as [Lewis, 1996]:

∇Sα(K) =α

(1− α)tr (Kα)UΛα−1U⊤ (6-7)

being K=UΛU⊤ the spectral decomposition of the kernel matrix with eigenvectors encom-

passed in U∈RN×N and eigenvalues Λ = diag(λ1, · · · , λN). The gradient of the conditional

entropy with respect to the output data kernel is given by:

∇KLSα(HL|c) =

αN

1− α

(NKL KC)α−1

tr ((NKL KC)α)KC −

α

1− α

(KL)α−1

tr ((KL)α)(6-8)

The derivatives with respect to MLP parameters can be intuitively computed using the

following chain rule:

Sα(HL|g)

dθ= ∇KLSα(H

L|c)

(dKL(e)

de

)(de

dθ

)(6-9)

Then, the Algorithm 1 summarizes the backpropagation procedure for updating MLP pa-

rameters using the matrix conditional entropy as the cost function.

6.1.2 Network pre-training using centered kernel alignment

Let H l=hli(zi)∈R

Pl+1:i∈[1..N ], with H l∈RPl+1×N , be the hidden state matrix projecting

Z to the l-th latent space. In order to encode the affinity between a pair of latent samples,

hln,h

lm, we define the following kernel function:

klnm = κ

(d(hl

n,hlm

)), (6-10)

being d:RPl+1 × RPl+1→R

+ a distance operator implementing the positive definite kernel

function κ(·). Therefore, the application of κ over each sample pair in H l yields to the

kernel matrix K l∈RN×N estimating the covariance of the induced random process Hl over

the RKHS. Upon the consideration of the linearity component between the layer transitions,

we apply the Mahalanobis distance that is defined for P -dimensional spaces by the following

inverse covariance matrix W lW l⊤:

d(hl

n,hlm

)=(hl

n − hlm

)W lW l⊤

(hl

n − hlm

)⊤. (6-11)

6.1 Multi-layer perceptron-based classifier using kernels 55

MCE-MLP 1 Backpropagation MLP training for Sα cost function.

1: Compute from input to output the latent variables sl,hl.

2: Compute the matrix P∈RN×N at the output layer:

P =(∇KLSα(H

L|c))

(dKL(e)

de

)

3: Compute from output to input the auxiliary equations:

gln =

hLn φ′(sL−1

n ) : l = L[(W l)⊤(gl

n)] φ′(sl−1

n ) : l ∈ [2..L− 1][(W 1)⊤(g1

n)] φ′(zn) : l = 1

glnm =

hLm φ′(sL−1

n ) : l = L[(W l)⊤(gl

nm)] φ′(sl−1

n ) : l ∈ [2..L− 1][(W 1)⊤(g1

nm)] φ′(zn) : l = 1

4: Compute the partial derivatives:

∂Sα(HL|g)

∂W l= 4

N∑

n,m=1

pnm(gln)(h

ln)

⊤ − 4

N∑

n,m=1

pnm(glnm)(h

ln)

⊤

∂Sα(HL|g)

∂bl= 4

N∑

n,m=1

pnm(gln)− 4

N∑

n,m=1

pnm(glnm)

5: Update the parameters at iteration t using the gradient descent rule:

W l(t) = W l(t− 1)− λt∂Sα(H

L|g)

∂W l

bl(t) = bl(t− 1)− λt∂Sα(H

L|g)

∂bl


With the purpose of improving the system performance regarding the learning speed and

classification accuracy, we introduce the available supervised knowledge into the pre-training

stage as the target kernel matrix KC. Then, we learn each matrix W l by maximizing the

similarity between K l and KC through the real-valued centered kernel alignment (CKA)

[Brockmeier et al., 2014]:

ρ(K l,KC

)=

⟨HK lH ,HKCH

⟩F

‖HK lH‖F ‖HKCH‖F, (6-12)

where H=I − N−111⊤ is a centering matrix (H∈RNxN ), 1∈RN is an all-ones vector, and

notations 〈·, ·〉F and ‖·, ·‖F stand for the Frobenius inner product and norm, respectively.

Due to the CKA cost function in Equation (6-12) provides an assembly of discriminative

linear projections W l better matching the relations between hidden states H l and target

information KC, we devise the following optimization problem to compute, at the end, the

projection matrix:

Wl= argmax

W l

ρ(K l,KC

), (6-13)

where the pre-trained Wlinitializes the l-th MLP layer.

Additionally, the weighting matrix mapping from the input to the first hidden layer allows

to analyze the contribution of the original features to the latent space by computing the

relevance vector ∈RP as the squared row averaging:

p=Ew2

qp : ∀u∈[1, P1], (6-14)

where the weight wqp∈R associates each the p-th feature with the q-th hidden neuron. No-

tation E · stands for the averaging operator. The main assumption behind the introduced

relevance in Equation (6-14) is that the larger the values of p, the larger the dependency of

the estimated embedding on the input attribute.

6.2 Experimental Setup

An automated, computer-aided diagnosis system based on artificial neural networks is intro-

duced to classify structural MRI scans from the ADNI dataset in accordance with the fol-

lowing three neurological classes: Normal Control (NC), Mild Cognitive Impairment (MCI),

and Alzheimer’s Disease (AD). Figure 6-1 illustrates the methodological development of the

proposed approach: Firstly, MRIs are independently segmented and a set of features are

extracted from resulting parcellations. Centered kernel alignment is proposed to learn a

projection matrix initializing the MLP and the matrix conditional entropy is minimized in

the MLP training. Tuned and trained model predicts the diagnosis for a given image.


Input

MRI Scans

MRI Processing:

-Segmentation

-Feature extraction

Cross-validation training:

-MLP classifier

-CKA pre-training

-MCE function

Output

CAD

Figure 6-1: General MRI classification pipeline

6.2.1 Processing of MRI data

We used FreeSurfer -version 5.1- (a free available1, widely used and extensively validated

brain MRI analysis software package) to process the structural brain MRI scans and compute

the morphological measurements [Fischl, 2012]. Freesurfer morphometric procedures are con-

sidered since they have shown suitable test-retest reliability across scanner manufacturers and

across field strengths, so becoming a standard for MRI feature extraction [Han et al., 2006].

The FreeSurfer pipeline is fully automatic and includes the following procedures: a watershed-

based skull stripping [Segonne et al., 2004], a transformation to the Talairach, an intensity

normalization and bias field correction [Sled et al., 1998], tessellation of the gray/white mat-

ter boundary, topology correction [Segonne et al., 2007], and a surface deformation [Fischl, 2004].

Consequently, a representation of the cortical surface between white and gray matter, of the

pial surface, and a segmentation of white matter from the rest of the brain are obtained.

FreeSurfer computes structure-specific volume, area, and thickness measurements. Corti-

cal and subcortical volumes are normalized to each subject’s Total Intracranial Volume

(eTIV) [Buckner et al., 2004]. Table 6-1 summarizes the five feature sets extracted for each

subject, which are concatenated into the feature matrix X with dimensions N=1993 and

D=324.

Type Number of features Units

Cortical Volumes (CV) 70 mm3

Subortical Volumes (SV) 42 mm3

Surface Area (SA) 72 mm2

Thickness Average (TA) 70 mm

Thickness Std. (TS) 70 mm

Total 324

Table 6-1: FreeSurfer extracted features.

1freesurfer.nmr.mgh.harvard.edu


6.2.2 Tuning of ANN model parameter

Given an input D=324 MRI features for classification of the 3- neurological classes, we

use the feedforward ANNs with one hidden layer, 324-input and 3-output neurons. An

exhaustive search is carried out for tuning the single free parameter, namely, the number

of neurons in the hidden layer (m1). For the sake of comparing our proposal, we also con-

sider AutoENcoders (AEN) [Vincent et al., 2010] and the well-known Principal Components

Analysis (PCA) for the initialization stage. All of these approaches (AEN, PCA, and CKA)

provide a projection matrix with an output dimension that, in this case, equates the hidden

layer size. Thus, resulting projections are used as the initial weights for the first layer while

biases and output layer weights are randomly initialized. We then train the MLP based on

the MCE cost function for a different number of neurons using 5-fold cross-validation scheme

as shown in Figure 6-2. Since we look for a network configuration with the highest accu-

racy and the lowest deviation, the resulting search indicates that the best number of hidden

neurons is accomplished by m1=20, m1=16, m1=14 for AEN, PCA, and CKA approaches,

respectively.

6 10 14 18 22 26 3040

60

80

Accuracy

(a) AEN

6 10 14 18 22 26 30

(b) PCA

6 10 14 18 22 26 30

(c) CKA

Figure 6-2: Artificial neural network performance along the number of nodes in the hidden

layer (m1) for the three initialization approachces: AutoENcoder, PCA-based

projection, and CKA-based projection. Results are computed under 5-fold cross-

validation scheme.

We further analyze the influence of each feature to the initialization process regarding the

relevance criterion introduced in Equation (6-14). Obtained results of relevance in Figure 6-3

show that CKA approach enhances the Subcortical Volume features at the time it diminishes

the influence of most Cortical Volumes and Thickness Averages. The relevance of each feature

set provided by AEN and PCA is practically the same. Hence, CKA allows the selection of

relevant biomarkers from MRI.


CV SV SA TA TS0

0.5

1

ρ

(a) AEN

CV SV SA TA TS

(b) PCA

CV SV SA TA TS

(c) CKA

Figure 6-3: Relevance indexes grouped by feature type: Cortical Volume (CV), Subcortical

Volume (SV), Surface Area (SA), Thickness Average (TA), and Thickness Std.

(TS).

6.2.3 Classifier performance of neurological classes

Algorithm Features Classifier

Abdulkadir Voxel-based morphometry Support Vector Machine

Ledig Volume and intensity relations Random forest classifier

Sørensen Volume, thickness, shape, inten-

sity relations

Regularized linear discriminant

analysis

Wachinger Volume, thickness, shape Generalized linear model

Table 6-2: Best performing algorithms in the 2014 CADDementia challenge.

As shown in Table 6-2, the ANN models that have been tuned for the three initialization

strategies are contrasted with the best four performing approaches of the 2014 CADDemen-

tia challenge [Bron et al., 2015]. The compared algorithms are evaluated in terms of their

classification performance: accuracy (Acc), area under the receiver-operating-characteristic

curve (Auc), and class-wise true positive rate (τ cp) criteria, respectively, defined as:

Acc =

∑c(t

cp + tcn)∑cN

c(6-15a)

τ c =tcpN c

(6-15b)

Auc =

∑cA

cuc ·N

c

∑c N

c, (6-15c)


where c∈NC,MCI,AD indexes each class, N c, tcp, and tcn is the number of samples, true

positives, and true negatives for the c-th class, respectively. The area under the curve Auc is

the weighted average of the area under the ROC curve of each class Acuc. Although the test

samples for the challenge and tested in this present paper are not the same, we assume that

both testing data are equivalent for evaluation purposes.

As seen in Table 6-3 that compares the classification performance on the 30% “best” quality

test set, the proposed approach outperforms other compared approaches of initialization.

Moreover, it performs better that other computer aided diagnosis methods as a whole. For

the “partial” quality images, as expected, the general performance diminishes in all MLP

approaches. Nonetheless, the overall accuracy and area under the curve is still competitive

with respect to the challenge winner. Based on the displayed ROC and confusion matrices

for the MLP-based classifiers with the optimum parameter set (see fig. 6-4), we also infer

that the proposed approach improves MCI discrimination.

Algorithm Acc τNC τMCI τAD Auc ANCuc AMCI

uc AADuc

2014 CADDementia

Sørensen 63.0 96.9 28.7 61.2 78.8 86.3 63.1 87.5

Wachigner 59.0 72.1 51.6 51.5 77.0 83.3 59.4 88.2

Ledig 57.9 89.1 41.0 38.8 76.7 86.6 59.7 84.9

Abdulkadir 53.7 45.7 65.6 49.5 77.7 85.6 59.9 86.7

“best” quality testing

NN-AEN 47.6 73.4 33.1 38.1 64.9 71.4 53.4 75.1

NN-PCA 63.8 70.4 56.7 66.9 80.0 87.2 70.0 87.0

NN-CKA 70.9 78.4 66.6 68.3 85.3 91.7 78.4 88.3

“partial” quality

NN-AEN 62.9 64.6 46.4 32.0 77.0 82.5 65.6 72.5

NN-PCA 64.4 67.6 49.3 26.0 78.4 82.3 67.5 79.2

NN-CKA 65.2 68.6 38.6 42.0 81.6 85.7 70.1 82.4

Table 6-3: Classification performance on the testing groups for considered algorithms under

evaluation criteria. Top: Baseline approaches. Bottom: ANN pre-trainings.

6.3 Summary

From the validation carried out above for MRI-based dementia diagnosis, the following as-

pects emerge as relevant for the developed proposal of ANN pre-training:

– As commonly implemented by the state-of-the-art ANN algorithms, proposed pre-

6.3 Summary 61

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

False Positive Rate

TruePositiveRate

NC (AUC: 71.4)

MCI (AUC: 53.4)

AD (AUC: 75.1)

(a) AEN (Auc: 64.9)

0 0.2 0.4 0.6 0.8 1

False Positive Rate

NC (AUC: 87.2)

MCI (AUC: 70.0)

AD (AUC: 87.0)

(b) PCA (Auc: 80.0)

0 0.2 0.4 0.6 0.8 1

False Positive Rate

NC (AUC: 91.7)

MCI (AUC: 78.4)

AD (AUC: 88.3)

(c) CKA (Auc: 85.3)

NC MCI AD

NC

MCI

AD

152

25.5%

28

4.7%

16

2.7%

126

21.1%

73

12.2%

48

8.0%

26

4.4%

70

11.7%

58

9.7%

Target Class

OutputClass

(d) AEN (Acc: 47.6)

NC MCI AD

127

21.3%

57

9.5%

12

2.0%

57

9.5%

134

22.4%

56

9.4%

5

0.8%

46

7.7%

103

17.3%

Target Class

(e) PCA (Acc: 63.8)

NC MCI AD

159

26.6%

34

5.7%

3

0.5%

51

8.5%

158

26.5%

38

6.4%

8

1.3%

43

7.2%

103

17.3%

Target Class

(f) CKA (Acc: 70.9)

Figure 6-4: Receiver-operating-characteristic curve (top) and confusion matrix (bottom) on

the 30% test data for AEN (left), PCA (center), and CKA (right) initizaliation

approaches at the best parameter set of the ANN classifier.


training approach also has a free model parameter that is the number of hidden neurons.

Tuning of this parameter is carried out heuristically by an exhaustive search so as to

reach the highest accuracy on a 5-fold cross-validation (see Figure 6-2). Thus, 14, 20,

and 16 hidden neurons are selected for CKA, AEN and PCA, respectively. As a result,

the suggested CKA approach improves other pre-training ANN approaches (in about

10%) with the additional benefit of decreasing the performed parameter sensitivity.

– We assess the influence of each MRI feature at the pre-training procedure regarding

the relevance criterion introduced in Equation (6-14). As follows from Figure 6-3,

AEN and PCA ponder every feature evenly, restraining their ability for extracting

biomarkers. By contrast, CKA enhances the influence of Subcortical Volumes and

Thickness Standard deviations at the time it diminishes the contribution of Cortical

Volumes and Thickness Averages. Consequently, the proposed approach is also suitable

for feature selection tasks.

– We compare the developed MLP approach with the best four classification strategies

of the 2014 CADDementia, devoted especially to dementia classification. From the

obtained results, summarized in table 6-3, it follows that our proposal outperforms

other algorithms in most of the evaluation criteria and imaging conditions, providing

the most balanced performance over all classes. Particularly, we increase by 7%-points

the classification accuracy and average area under the ROC curve. It is worth noting

that although the Sørensen’s approach accomplishes a τNC value that is 18.5%-points

higher than the proposal, its performance turns out to be biased towards the NC, yield-

ing a worse value of MCI. That is, CKA-MLP carries out unbiased class performance

of the dementia classification. In the case of “partial” quality images, despite the

general performance reduction, proposed pre-training remains as the best approach.

Moreover, the overall measures are still competitive with the results provided by the

CADDementia challenge.

– Figure 6-4 shows the per-class ROC curves and confusion matrices obtained by the

contrasted approaches. In all cases, the area under the curve and accuracy for NC

and AD classes are higher than the ones achieved by the MCI class (Figures 6.4(a)

to 6.4(c)). Hence, MCI classification from the incorporated MRI features remains a

challenging task due to the following facts: the widely-known MCI heterogeneity, the

MCI is an intermediate class between healthy and diagnosed Alzheimer’s, and the MCI

subjects may eventually convert to AD or NC. Moreover, confusion matrices displayed

in Figures 6.4(d) to 6.4(f) confirm that NC and AD are suitably for distinction in most

of the cases. Nevertheless, the MCI class introduces the most errors when considered

as both, target or output class. Therefore, particular studies on the mild cognitive

impairment should improve the diagnosis [Ramırez et al., 2016, Wolz et al., 2011].

7 Conclusions and Future Work

7.1 Concluding remarks

– A new kernel function estimation based on an information potential variability frame-

work is presented. KEIPV estimates an RKHS spanning the most widely information

force magnitudes among data points. Particularly, KEIPV relates different kernel

functions with the intrinsic information potential variations in Parzen-based pdf esti-

mations [Principe, 2010]. Thereby, we seek for an RKHS that maximizes the overall

information potential variability with respect to the global kernel parameter. An up-

dating rule for estimating the Gaussian kernel bandwidth parameter is also introduced

as a function of the forces induced by the distances among samples. KEIPV criterion

is considered for tuning Gaussian kernel parameters in the whole development of this

manuscript.

– A kernel-based image representation is introduced to support MRI discrimination in

segmentation of brain structures. Our proposal encodes inter-slice variations related

to the brain structure distribution. Thus, head patterns can be extracted along each

3D axis view, namely, Axial, Coronal, and Sagittal. Taking into account the attained

results over well-know MRI datasets, proposed kernel-based representation methodol-

ogy proves to find the natural inherent image distributions, namely, age and gender

categories. The representation is evaluated as template selection for image segmenta-

tion, termed KAISER. Results prove that selecting templates reduce the bias towards

population averages by providing small subset better performing than the whole atlas

set. In addition, KAISER selects the smallest atlas subset with the best performance

in comparison with other conventional selection strategies, as SSD and MI.

– We discuss the use of information-based measures to optimize model parameters for

MRI segmentation. In particular, we introduce the α-order Renyi’s entropy as a new

cost function for finding the tissue distribution parameters under the assumption of

normally distributed classes. Additionally, we develop the updating equations for

an EM-based optimization using the considered function. As a result, parameters

are updated from weighted voxel-wise averages, where the influence of the r-th voxel

64 7 Conclusions and Future Work

for each parameter is (brcfrc/qrc)α . We also prove the relation between two different

entropy orders: For α∈[0, 1), the larger the order, the smaller the information measure.

In fact, the maximum possible value for the Renyi’s entropy is achieved when α=0,

corresponding to H0=− log(

1|Ω|

). Additionally, the entropy order and the algorithm

convergence are proportionality related. The above is due to the influence of α in the

probability values: As α tends to zero, the entropy tends to weight all the events more

evenly, regardless their probability, i.e., (brlfrl/qrl)α → 1; ∀r∈Ω, l∈L. On the other

hand, for large order values, the entropy is determined by the most probable events.

Regarding the segmentation accuracy, we show that the larger the noise intensity, the

larger the number of misclassifications. Additionally, we compare our proposal against

the log-likelihood as the baseline approach. Achieved results for the optimal entropy

order show that our scheme outperforms the baseline, since the obtained parameters

for the entropy function are more discriminative than those for the log-likelihood.

– Chapter 5 proposes a new multi-atlas weighted label fusion approach for brain image

segmentation that takes advantage of a more elaborated fusing procedure incorpo-

rating the knowledge about the neighborhood as well as the patch structure of the

considered tissues. For this purpose, all image patches are projected into a discrimi-

nating space that maximizes the similarity between the labels and the feature vectors

by using the introduced centered kernel alignment criterion. Besides, the adopted

neighborhood-wise analysis allows accounting more useful information about the tis-

sue structure localities to avoid the influence of small registration issues on the query

image. Nonetheless, we devise a couple of restrictions on the use of centered kernel

alignment (CKA): Firstly, the number of samples should be larger than input and

output dimensions to avoid overfitted projections. We cope with this drawback by

considering large enough sampling subsets for training purposes. In other wise, val-

idation techniques are recommended to be included in the CKA learning. Secondly,

attained projections must always be lower dimensional than the original feature space.

In this case, the enhancement on tissue discrimination is due to the affinity between

labels and features, not to an increase of the dimension.

– Finally, a new multi-layer perceptron training is introduced aiming to improve the

computer-aided diagnosis of dementia. Given a set of features extracted from a sub-

ject’s brain MRI, the dementia diagnosis task consists on assigning subjects to Normal

Control, Mild Cognitive Impairment (MCI), or Alzheimer’s Disease. To improve the

classification performance, we incorporate a matrix projecting the samples into a more

discriminating feature space so that the affinity between projected features and class

labels is maximized. Such an affinity criterion is implemented by the Centered Kernel

Alignment (CKA) providing two key benefits: i) The only free parameter is the hidden

dimension. ii) A relevance analysis can be introduced to find biomarkers. MLP is then

7.2 Future Work 65

trained using gradient descent algorithm minimizing the matrix conditional entropy as

a new cost function. As a result, our proposal outperforms the contrasted algorithms

(7% of classification accuracy and area under the ROC curve), and reduces the class

biasing, resulting in a better MCI discrimination.

7.2 Future Work

– Regarding the template selection, following research lines are proposed: i) Since ob-

tained decomposition eigenvectors showed non-linear relations, other non-linear embed-

ding techniques, e.g. local linear embedding, can be used for highlighting the essential

structure. Slice-wise metrics, as the mutual information, can be tested as the core of

ISK after being demonstrated to satisfy the kernel properties. Also, other than slice

partitions can be explored to account for different spatial dependencies, e.g. 3D blocks.

Finally, tensor-product kernel parameters can be tuned up under supervised schemes

aiming to highlight other demographic categories.

– A result of Renyi’s entropy for probabilistic image segmentation is the voxel-wise in-

fluence of each voxel to the cost function minimization. Such a contribution may be

extended to model the image intensity distribution with other methods as Parzen win-

dows. In addition, such kind of information metrics can be further adapted to other

image processing tasks (e.g. registration) or unified schemes involving registration,

template selection, dictionary learning [Roy et al., 2014].

– The centered kernel alignment for label fusion (CKA-LF) can be further joined to

multimodal segmentation approaches for profiting the best the different imaging tech-

niques and to discard the nonuseful ones. Moreover, CKA-LF has to be evaluated in

template-based segmentation of structures as bone, liver, lung, and heart. On the other

hand, convolution masks can be used for extracting features from image modalities as

functional MRI. In this regard, CKA provides a new strategy for learning such masks

aiming to enhance tasks classification.

– In Chapter 6, we proposed a new training scheme for multi-layer perceptrons using

kernel-based cost functions. We plan to evaluate such a scheme in other brain MRI

tasks as predicting Alzheimer’s conversion from MCI, classifiying attention deficit hy-

peractivity disorder, and building temporal atlases. Finally, CKA and matrix condi-

tional entropy functions can be embedded into new machine learning tools as deep

learning for enhacing their performance.


7.3 Academic discussion

1 D. Cardenas-Pena, Diego Collazos-Huertas, and German Castellanos-Dominguez, “Cen-

tered Kernel Alignment Enhancing Neural Network Pretraining for MRI-

Based Dementia Diagnosis,” Computational and Mathematical Methods in Medicine,

vol. 2016, Article ID 9523849, 10 pages, 2016.

2 M. Orbes-Arteaga, D. Cardenas-Pena and G. Castellanos-Dominguez “Head and

Neck Auto Segmentation Challenge based on Non-Local Generative Mod-

els,” in MIDAS Journal, 2016.

3 E. E. Bron, M. Smits, W. M. Van Der Flier, H. Vrenken, F. Barkhof, P. Scheltens,

J. M. Papma, R. M. Steketee, C. M. Orellana, R. Meijboom, et al., “Standardized

evaluation of algorithms for computer-aided diagnosis of dementia based on

structural mri: The caddementia challenge,” NeuroImage, vol. 111, pp. 562–579,

2015.

4 D. Cardenas-Pena, M. Orbes-Arteaga, and G. Castellanos-Dominguez, “Supervised

brain tissue segmentation using a spatially enhanced similarity metric,” in

Artificial Computation in Biology and Medicine, pp. 398–407, Springer International

Publishing, 2015.

5 M. Orbes-Arteaga, D. Cardenas-Pena, M. A. Alvarez, A. Orozco, and G. Castellanos-

Dominguez, “Spatial-dependent similarity metric supporting multi-atlas mri

segmentation,” in Pattern Recognition and Image Analysis, pp. 300–308, Springer

International Publishing, 2015.

6 D. Cardenas-Pena, A. A. Orozco, and G. Castellanos-Dominguez, “Information-

based cost function for a bayesian mri segmentation framework,” in Image

Analysis and Processing-ICIAP 2015, pp. 548–556, Springer International Publishing,

2015.

7 M. Orbes-Arteaga, D. Cardenas-Pena, M. A. Alvarez, A. A. Orozco, and G. Castellanos-

Dominguez, “Kernel centered alignment supervised metric for multi-atlas

segmentation,” in Image Analysis and Processing-ICIAP 2015, pp. 658–667, Springer

International Publishing, 2015. Best young paper award finalist.

8 V. Machairas, M. Faessel, D. Cardenas-Pena, T. Chabardes, T. Walter, and E. De-

cenciere, “Waterpixels,” Image Processing, IEEE Transactions on, vol. 24, no. 11,

pp. 3707–3716, 2015.

9 M. Orbes-Arteaga, D. Cardenas-Pena, M. A. Alvarez, A. A. Orozco, and G. Castellanos-

http://dx.doi.org/10.1155/2016/9523849

http://hdl.handle.net/10380/3539

http://dx.doi.org/10.1016/j.neuroimage.2015.01.048

http://dx.doi.org/10.1007/978-3-319-18914-7_42

http://dx.doi.org/10.1007/978-3-319-19390-8 34

http://dx.doi.org/10.1007/978-3-319-23231-7_49

http://dx.doi.org/10.1007/978-3-319-23231-7_59

http://dx.doi.org/10.1109/TIP.2015.2451011

http://dx.doi.org/10.1007/978-3-319-25751-8_47

7.3 Academic discussion 67

Dominguez, “Magnetic resonance image selection for multi-atlas segmenta-

tion using mixture models,” in Progress in Pattern Recognition, Image Analysis,

Computer Vision, and Applications, pp. 391–399, Springer International Publishing,

2015.

10 D. Cardenas-Pena, M. Orbes-Arteaga, A. Castro-Ospina, A. Alvarez-Meza, and G.

Castellanos-Dominguez, “A Kernel-based Representation to Support 3D MRI

Unsupervised Clustering,” 22nd International Conference on Pattern Recognition

in Stockholm, August 2014.

11 D. Cardenas-Pena, M. Orbes-Arteaga, and G. Castellanos-Dominguez, “Kernel-based

Atlas Image Selection for Brain Tissue Segmentation,” 36th Annual Interna-

tional Conference of the IEEE Engineering in Medicine and Biology Society in Chicago,

August 2014.

12 D. Cardenas-Pena, A. Alvarez-Meza, and G. Castellanos-Dominguez. “Kernel-Based

Image Representation for Brain MRI Discrimination.” Progress in Pattern

Recognition, Image Analysis, Computer Vision, and Applications. Springer Interna-

tional Publishing, 2014. 343-350. APR-CIARP 2014 Best Paper Award.

13 A. Alvarez-Meza, D. Cardenas-Pena, A.E. Castro-Ospina, and G. Castellanos-Dominguez,

“Tensor-Product Kernel-based Representation encoding Joint MRI View

Similarity,” 36th Annual International Conference of the IEEE Engineering in Medicine

and Biology Society in Chicago, August 2014.

14 E. Cuartas-Morales, D. Cardenas-Pena, and G. Castellanos-Dominguez, “Influence of

anisotropic white matter modeling on EEG source localization,” 36th Annual

International Conference of the IEEE Engineering in Medicine and Biology Society in

Chicago, August 2014.

15 D. Collazos-Huertas, A. Giraldo-Forero, D. Cardenas-Pena, A. Alvarez Meza, and G.

Castellanos-Domınguez, “Functional protein prediction using hmm based fea-

ture representation and relevance analysis,” Advances in Intelligent Systems and

Computing, vol. 232, pp. 71-76, 2014.

16 Strobbe, Gregor and David, Cardenas-Pena and Montes Restrepo, Victoria Eugenia

and van Mierlo, Pieter and Vandenberghe, Stefaan, “Selecting volume conductor

models for EEG source localization of epileptic spikes: preliminary results

based on 4 operated epileptic patients,” International Conference on Basic and

Clinical Multimodal Imaging, Abstracts, Geneva-Switzerland, 2013.

17 D. Cardenas-Pena, J. Martınez-Vargas, and G. Castellanos-Domınguez, “Local bi-

http://dx.doi.org/10.1109/ICPR.2014.552

http://dx.doi.org/10.1109/EMBC.2014.6944228




http://dx.doi.org/10.1007/978-3-319-01568-2_10

https://biblio.ugent.be/publication/4387776



nary fitting energy solution by graph cuts for mri segmentation,” pp. 5131-

5134, 2013.

18 D. Cardenas-Pena, M. Orozco-Alzate, and G. Castellanos-Dominguez, “Selection of

time-variant features for earthquake classification at the nevado-del-ruiz

volcano,” Computers.

19 J. Martinez-Vargas, D. Cardenas-Pena, and G. Castellanos-Dominguez, “Extraction

of stationary components in biosignal discrimination,” pp. 1-4, 2012.

20 D. Cardenas-Pena, J. Martınez-Vargas, and G. Castellanos-Domınguez, “Extraction

of stationary spectral components using stochastic variability,” Lecture Notes

in Computer Science (including subseries Lecture Notes in Artificial Intelligence and

Lecture Notes in Bioinformatics), vol. 7441 LNCS, pp. 765-772, 2012.

http://dx.doi.org/10.1016/j.cageo.2012.08.012


http://dx.doi.org/10.1007/978-3-642-33275-3_94

Bibliography

[Ahmed et al., 2011] Ahmed, S., Iftekharuddin, K. M., and Vossough, A. (2011). Efficacy

of texture, shape, and intensity feature fusion for posterior-fossa tumor segmentation in

MRI. IEEE transactions on information technology in biomedicine : a publication of the

IEEE Engineering in Medicine and Biology Society, 15(2):206–13.

[Aljabar et al., 2009] Aljabar, P., Heckemann, R. A., Hammers, A., Hajnal, J. V., and

Rueckert, D. (2009). Multi-atlas based segmentation of brain images: Atlas selection

and its effect on accuracy. NeuroImage, 46(3):726 – 738.

[Alvarez-Meza et al., 2014] Alvarez-Meza, A. M., Cardenas-Pena, D., and Castellanos-

Dominguez, G. (2014). Unsupervised Kernel Function Building Using Maximization of

Information Potential Variability. In Bayro-Corrochano, E. and Hancock, E., editors,

Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications SE

- 41, volume 8827 of Lecture Notes in Computer Science, pages 335–342. Springer Inter-

national Publishing.

[Alzubi et al., 2011] Alzubi, S., Islam, N., and Abbod, M. (2011). Multiresolution analysis

using wavelet, ridgelet, and curvelet transforms for medical image segmentation. Interna-

tional journal of biomedical imaging, 2011:136034.

[Amato et al., 2013] Amato, F., Lopez, A., Pena-Mendez, E. M., VaAˆhara, P., Hampl, A.,

and Havel, J. (2013). Artificial neural networks in medical diagnosis. Journal of Applied

Biomedicine, 11(2):47–58.

[Ashburner, 2007] Ashburner, J. (2007). A fast diffeomorphic image registration algorithm.

NeuroImage, 38(1):95–113.

[Ashburner and Friston, 2000] Ashburner, J. and Friston, K. J. (2000). Voxel-based

morphometry–the methods. NeuroImage, 11(6 Pt 1):805–21.

[Ashburner and Friston, 2005] Ashburner, J. and Friston, K. J. (2005). Unified segmenta-

tion. NeuroImage, 26(3):839–51.

[Aubert-Broche et al., 2006] Aubert-Broche, B., Griffin, M., Pike, G. B., Evans, A. C., and

70 Bibliography

Collins, D. L. (2006). Twenty new digital brain phantoms for creation of validation image

data bases. IEEE transactions on medical imaging, 25(11):1410–6.

[Avants and Gee, 2004] Avants, B. and Gee, J. C. (2004). Geodesic estimation for large

deformation anatomical shape averaging and interpolation. NeuroImage, 23 Suppl 1:S139–

50.

[Bai et al., 2015] Bai, W., Shi, W., Ledig, C., and Rueckert, D. (2015). Multi-atlas seg-

mentation with augmented features for cardiac \MR\ images. Medical Image Analysis,

19(1):98–109.

[Balafar et al., 2010] Balafar, M. a., Ramli, a. R., Saripan, M. I., and Mashohor, S.

(2010). Review of brain MRI image segmentation methods. Artificial Intelligence Re-

view, 33(3):261–274.

[Bengio, 2009] Bengio, Y. (2009). Learning Deep Architectures for AI. Foundations and

TrendsA R© in Machine Learning, 2(1):1–127.

[Bengio, 2012] Bengio, Y. (2012). Practical recommendations for gradient-based training of

deep architectures. In Neural Networks: Tricks of the Trade, volume 7700 of Lecture Notes

in Computer Science, pages 437–478. Springer Berlin Heidelberg.

[Bengio et al., 2007] Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007).

Greedy layer-wise training of deep networks. In Scholkopf, B., Platt, J., and Hoffman, T.,

editors, Advances in Neural Information Processing Systems 19 (NIPS’06), pages 153–160.

MIT Press.

[Bohland et al., 2009] Bohland, J. W., Bokil, H., Allen, C. B., and Mitra, P. P. (2009). The

brain atlas concordance problem: quantitative comparison of anatomical parcellations.

PloS one, 4(9):e7200.

[Brinkmann et al., 1998] Brinkmann, B. H., Manduca, A., and Robb, R. A. (1998). Opti-

mized homomorphic unsharp masking for MR grayscale inhomogeneity correction. Medical

Imaging, IEEE Transactions on, 17(2):161–171.

[Brockmeier et al., 2014] Brockmeier, A., Choi, J., Kriminger, E., Francis, J., and Principe,

J. (2014). Neural decoding with kernel-based metric learning. Neural Computation, 26:–.

[Bron et al., 2015] Bron, E. E., Smits, M., van der Flier, W. M., Vrenken, H., Barkhof,

F., Scheltens, P., Papma, J. M., Steketee, R. M., Orellana, C. M., Meijboom, R., Pinto,

M., Meireles, J. R., Garrett, C., Bastos-Leite, A. J., Abdulkadir, A., Ronneberger, O.,

Amoroso, N., Bellotti, R., Cardenas-Pena, D., Alvarez-Meza, A. M., Dolph, C. V.,

Iftekharuddin, K. M., Eskildsen, S. F., Coupe, P., Fonov, V. S., Franke, K., Gaser, C.,

Bibliography 71

Ledig, C., Guerrero, R., Tong, T., Gray, K. R., Moradi, E., Tohka, J., Routier, A., Dur-

rleman, S., Sarica, A., Fatta, G. D., Sensi, F., Chincarini, A., Smith, G. M., Stoyanov,

Z. V., Sørensen, L., Nielsen, M., Tangaro, S., Inglese, P., Wachinger, C., Reuter, M., van

Swieten, J. C., Niessen, W. J., and Klein, S. (2015). Standardized evaluation of algorithms

for computer-aided diagnosis of dementia based on structural MRI: The CADDementia

challenge. NeuroImage, 111:562–579.

[Brox and Cremers, 2008] Brox, T. and Cremers, D. (2008). On Local Region Models and a

Statistical Interpretation ofA theA Piecewise Smooth Mumford-Shah Functional. Inter-

national Journal of Computer Vision, 84(2):184–193.

[Buckner et al., 2004] Buckner, R. L., Head, D., Parker, J., Fotenos, A. F., Marcus, D., Mor-

ris, J. C., and Snyder, A. Z. (2004). A unified approach for morphometric and functional

data analysis in young, old, and demented adults using automated atlas-based head size

normalization: reliability and validation against manual measurement of total intracranial

volume. NeuroImage, 23(2):724–738.

[Cerrolaza et al., 2012] Cerrolaza, J. J., Villanueva, A., and Cabeza, R. (2012). Hierarchical

statistical shape models of multiobject anatomical structures: application to brain MRI.

IEEE transactions on medical imaging, 31(3):713–24.

[Cho et al., 1997] Cho, S., Jones, D., Reddick, W. E., Ogg, R. J., and Steen, R. G. (1997).

ESTABLISHING NORMS FOR AGE-RELATED CHANGES IN PROTON T1 OF HU-

MAN BRAIN TISSUE IN VIVO. Magnetic resonance imaging, 15(10):1133–1143.

[Chyzhyk et al., 2014] Chyzhyk, D., Savio, A., and Grana, M. (2014). Evolutionary ELM

wrapper feature selection for Alzheimer’s disease CAD on anatomical brain MRI. Neuro-

computing, 128:73–80.

[Cocosco et al., 1997] Cocosco, C. A., Kollokian, V., Kwan, R. K.-S., Pike, G. B., and Evans,

A. C. (1997). BrainWeb: Online Interface to a 3D MRI Simulated Brain Database.

NeuroImage, 5:425.

[Cortes et al., 2012] Cortes, C., Mohri, M., and Rostamizadeh, A. (2012). Algorithms for

learning kernels based on centered alignment. The Journal of Machine Learning, 13:795–

828.

[Coupe et al., 2011] Coupe, P., Manjon, J. V., Fonov, V., Pruessner, J., Robles, M., and

Collins, D. L. (2011). Patch-based segmentation using expert priors: Application to

hippocampus and ventricle segmentation. NeuroImage, 54(2):940–954.

[Cuingnet et al., 2011] Cuingnet, R., Gerardin, E., Tessieras, J., Auzias, G., Lehericy, S.,

72 Bibliography

Habert, M.-O., Chupin, M., Benali, H., and Colliot, O. (2011). Automatic classification

of patients with Alzheimer’s disease from structural MRI: a comparison of ten methods

using the ADNI database. NeuroImage, 56(2):766–81.

[Darvas et al., 2004] Darvas, F., Pantazis, D., Kucukaltun-Yildirim, E., and Leahy, R. M.

(2004). Mapping human brain function with MEG and EEG: methods and validation.

NeuroImage, 23 Suppl 1:S289–99.

[Davatzikos, 2004] Davatzikos, C. (2004). Why voxel-based morphometric analysis should be

used with great caution when characterizing group differences. NeuroImage, 23(1):17–20.

[De et al., 2011] De, A., Bhattacharjee, A. K., Chanda, C. K., and Maji, B. (2011). MRI

segmentation using Entropy maximization and Hybrid Particle Swarm Optimization with

Wavelet Mutation. 2011 World Congress on Information and Communication Technolo-

gies, pages 362–367.

[de Munck et al., 1988] de Munck, J. C., van Dijk, B. W., and Spekreijse, H. (1988). Mathe-

matical dipoles are adequate to describe realistic generators of human brain activity. IEEE

transactions on bio-medical engineering, 35(11):960–6.

[Demirhan and Guler, 2011] Demirhan, A. and Guler, A. (2011). Combining stationary

wavelet transform and self-organizing maps for brain MR image segmentation. Engineering

Applications of Artificial Intelligence, 24(2):358–367.

[Dubois et al., 2014] Dubois, B., Feldman, H. H., Jacova, C., Hampel, H., Molinuevo, J. L.,

Blennow, K., DeKosky, S. T., Gauthier, S., Selkoe, D., Bateman, R., Cappa, S., Crutch,

S., Engelborghs, S., Frisoni, G. B., Fox, N. C., Galasko, D., Habert, M.-O., Jicha, G. A.,

Nordberg, A., Pasquier, F., Rabinovici, G., Robert, P., Rowe, C., Salloway, S., Sarazin,

M., Epelbaum, S., de Souza, L. C., Vellas, B., Visser, P. J., Schneider, L., Stern, Y.,

Scheltens, P., and Cummings, J. L. (2014). Advancing research diagnostic criteria for

Alzheimer’s disease: the IWG-2 criteria. Lancet neurology, 13(6):614–29.

[Eskildsen et al., 2015] Eskildsen, S. F., Coupe, P., Fonov, V. S., Pruessner, J. C., and

Collins, D. L. (2015). Structural imaging biomarkers of Alzheimer’s disease: predicting

disease progression. Neurobiology of Aging, 36:S23–S31.

[Falahati et al., 2014] Falahati, F., Westman, E., and Simmons, A. (2014). Multivariate data

analysis and machine learning in Alzheimer’s disease with a focus on structural magnetic

resonance imaging. Journal of Alzheimer’s disease : JAD, 41(3):685–708.

[Farhan et al., 2014] Farhan, S., Fahiem, M. A., and Tauseef, H. (2014). An ensemble-of-

classifiers based approach for early diagnosis of Alzheimer’s disease: classification using

Bibliography 73

structural features of brain images. Computational and mathematical methods in medicine,

2014:862307.

[Fischl, 2004] Fischl, B. (2004). Automatically Parcellating the Human Cerebral Cortex.

Cerebral Cortex, 14(1):11–22.

[Fischl, 2012] Fischl, B. (2012). FreeSurfer. NeuroImage, 62(2):774–81.

[Fischl et al., 2002] Fischl, B., Salat, D. H., Busa, E., Albert, M., Dieterich, M., Haselgrove,

C., van der Kouwe, A., Killiany, R., Kennedy, D., Klaveness, S., Montillo, A., Makris, N.,

Rosen, B., and Dale, A. M. (2002). Whole brain segmentation: automated labeling of

neuroanatomical structures in the human brain. Neuron, 33(3):341–55.

[Folstein et al., 1975] Folstein, M. F., Folstein, S. E., and McHugh, P. R. (1975). “Mini-

mental state”: A practical method for grading the cognitive state of patients for the

clinician. Journal of Psychiatric Research, 12(3):189–198.

[Giraldo and Principe, 2013] Giraldo, L. G. S. and Principe, J. C. (2013). Information The-

oretic Learning with Infinitely Divisible Kernels.

[Grech et al., 2008] Grech, R., Cassar, T., Muscat, J., Camilleri, K. P., Fabri, S. G., Zer-

vakis, M., Xanthopoulos, P., Sakkalis, V., and Vanrumste, B. (2008). Review on solving

the inverse problem in EEG source analysis. Journal of neuroengineering and rehabilita-

tion, 5:25.

[Han et al., 2006] Han, X., Jovicich, J., Salat, D., van der Kouwe, A., Quinn, B., Czanner,

S., Busa, E., Pacheco, J., Albert, M., Killiany, R., Maguire, P., Rosas, D., Makris, N.,

Dale, A., Dickerson, B., and Fischl, B. (2006). Reliability of MRI-derived measurements

of human cerebral cortical thickness: the effects of field strength, scanner upgrade and

manufacturer. NeuroImage, 32(1):180–94.

[Heckemann et al., 2006] Heckemann, R. a., Hajnal, J. V., Aljabar, P., Rueckert, D., and

Hammers, A. (2006). Automatic anatomical brain MRI segmentation combining label

propagation and decision fusion. NeuroImage, 33(1):115–26.

[Hinton et al., 2006] Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A fast learning

algorithm for deep belief nets. Neural computation, 18(7):1527–54.

[Iftekharuddin et al., 2009] Iftekharuddin, K. M., Zheng, J., Islam, M. a., and Ogg, R. J.

(2009). Fractal-based brain tumor detection in multimodal MRI. Applied Mathematics

and Computation, 207(1):23–41.

[Isgum et al., 2009] Isgum, I., Staring, M., Rutten, A., Prokop, M., Viergever, M. A., and

74 Bibliography

Van Ginneken, B. (2009). Multi-atlas-based segmentation with local decision fusion-

application to cardiac and aortic segmentation in CT scans. IEEE Transactions on Medical

Imaging, 28(7):1000–1010.

[Jack et al., 2013] Jack, C. R., Knopman, D. S., Jagust, W. J., Petersen, R. C., Weiner,

M. W., Aisen, P. S., Shaw, L. M., Vemuri, P., Wiste, H. J., Weigand, S. D., Lesnick,

T. G., Pankratz, V. S., Donohue, M. C., and Trojanowski, J. Q. (2013). Tracking patho-

physiological processes in Alzheimer’s disease: an updated hypothetical model of dynamic

biomarkers. The Lancet. Neurology, 12(2):207–16.

[Jenssen et al., 2003] Jenssen, R., Principe, J., and Eltoft, T. (2003). Information cut and in-

formation forces for clustering. In Neural Networks for Signal Processing, 2003. NNSP’03.

2003 IEEE 13th Workshop on, pages 459–468.

[Jung et al., 2015] Jung, W. B., Lee, Y. M., Kim, Y. H., and Mun, C.-w. (2015). Automated

Classification to Predict the Progression of Alzheimer’s Disease Using Whole-Brain Vol-

umetry and DTI. Psychiatry Investigation, 12(1):92–102.

[Khedher et al., 2015] Khedher, L., Ramırez, J., Gorriz, J., Brahim, A., and Segovia, F.

(2015). Early diagnosis of Alzheimer’s disease based on partial least squares, principal

component analysis and support vector machine using segmented MRI images. Neuro-

computing, 151:139–150.

[Kimeldorf and Wahba, 1971] Kimeldorf, G. and Wahba, G. (1971). Some results on tcheby-

cheffian spline functions. Journal of mathematical analysis and applications, 33(1):82–95.

[Kloppel et al., 2012a] Kloppel, S., Abdulkadir, A., Jack, C. R., Koutsouleris, N., Mourao-

Miranda, J., and Vemuri, P. (2012a). Diagnostic neuroimaging across diseases. NeuroIm-

age, 61(2):457–463.

[Kloppel et al., 2012b] Kloppel, S., Abdulkadir, A., Jack, C. R., Koutsouleris, N., Mourao-

Miranda, J., and Vemuri, P. (2012b). Diagnostic neuroimaging across diseases. NeuroIm-

age, 61(2):457–63.

[Kloppel et al., 2015] Kloppel, S., Peter, J., Ludl, A., Pilatus, A., Maier, S., Mader, I., Heim-

bach, B., Frings, L., Egger, K., Dukart, J., Schroeter, M. L., Perneczky, R., Haussermann,

P., Vach, W., Urbach, H., Teipel, S., Hull, M., and Abdulkadir, A. (2015). Applying

Automated MR-Based Diagnostic Methods to the Memory Clinic: A Prospective Study.

Journal of Alzheimer’s disease : JAD, 47(4):939–54.

[Kuklisova-Murgasova et al., 2011] Kuklisova-Murgasova, M., Aljabar, P., Srinivasan, L.,

Counsell, S. J., Doria, V., Serag, A., Gousias, I. S., Boardman, J. P., Rutherford, M. a.,

Bibliography 75

Edwards, a. D., Hajnal, J. V., and Rueckert, D. (2011). A dynamic 4D probabilistic atlas

of the developing brain. NeuroImage, 54(4):2750–2763.

[Kwan et al., 1996] Kwan, R. K.-S., Evans, A. C., and Pike, G. B. (1996). An extensible

MRI simulator for post-processing evaluation. In Hohne, K. H. and Kikinis, R., editors,

Visualization in Biomedical Computing, volume 1131 of Lecture Notes in Computer Sci-

ence, pages 135–140. Springer Berlin Heidelberg, Berlin, Heidelberg.

[Lanfer et al., 2012] Lanfer, B., Scherg, M., Dannhauer, M., Kn??sche, T. R., Burger, M.,

and Wolters, C. H. (2012). Influences of skull segmentation inaccuracies on EEG source

analysis. NeuroImage, 62(1):418–431.

[Lankton et al., 2007] Lankton, S., Nain, D., Yezzi, A., and Tannenbaum, A. (2007). Hybrid

geodesic region-based curve evolutions for image segmentation. In Cleary, K. R., Hsieh,

J., Manduca, A., Pluim, J. P. W., Horii, S. C., Emelianov, S. Y., Giger, M. L., Jiang, Y.,

Sahiner, B., Karssemeijer, N., McAleavey, S. A., Andriole, K. P., Reinhardt, J. M., Hu,

X. P., Flynn, M. J., and Miga, M. I., editors, Proc. SPIE 6510, Medical Imaging 2007:

Physics of Medical Imaging, volume 6510, pages 65104U–65104U–10.

[Lankton and Tannenbaum, 2008] Lankton, S. and Tannenbaum, A. (2008). Localizing

region-based active contours. IEEE transactions on image processing : a publication of

the IEEE Signal Processing Society, 17(11):2029–39.

[Ledig et al., 2012] Ledig, C., Wolz, R., Aljabar, P., Lotjonen, J., Heckemann, R. a., Ham-

mers, A., and Rueckert, D. (2012). Multi-class brain segmentation using atlas propagation

and EM-based refinement. Proceedings - International Symposium on Biomedical Imaging,

pages 896–899.

[Lewis, 1996] Lewis, a. S. (1996). Derivatives of Spectral Functions. Mathematics of Oper-

ations Research, 21(3):576–588.

[Li et al., 2007] Li, C., Kao, C.-Y., Gore, J. C., and Ding, Z. (2007). Implicit Active Con-

tours Driven by Local Binary Fitting Energy. 2007 IEEE Conference on Computer Vision

and Pattern Recognition, pages 1–7.

[Li et al., 2008] Li, M., Huang, T., and Zhu, G. (2008). Improved Fast Fuzzy C-Means

Algorithm for Medical MR Images Segmentation. 2008 Second International Conference

on Genetic and Evolutionary Computing, pages 285–288.

[Liew and Yan, 2006] Liew, A. W.-C. and Yan, H. (2006). Current Methods in the Auto-

matic Tissue Segmentation of 3D Magnetic Resonance Brain Images. Current Medical

Imaging Reviews, 2(1):91–103.

76 Bibliography

[Liu et al., 2011] Liu, W., Principe, J. C., and Haykin, S. (2011). Kernel Adaptive Filtering:

A Comprehensive Introduction, volume 57. John Wiley & Sons.

[Lotjonen et al., 2010] Lotjonen, J. M., Wolz, R., Koikkalainen, J. R., Thurfjell, L., Walde-

mar, G., Soininen, H., and Rueckert, D. (2010). Fast and robust multi-atlas segmentation

of brain magnetic resonance images. NeuroImage, 49(3):2352–65.

[Ma et al., 2014] Ma, G., Gao, Y., Wu, G., Wu, L., and Shen, D. (2014). Atlas-Guided

Multi-channel Forest Learning for Human Brain Labeling. In Menze, B., Langs, G.,

Montillo, A., Kelm, M., Muller, H., Zhang, S., Cai, W. T., and Metaxas, D., editors,

Medical Computer Vision: Algorithms for Big Data, volume 8848 of LNCS, pages 97–104.

Springer International Publishing.

[Magnin et al., 2009] Magnin, B., Mesrob, L., Kinkingnehun, S., Pelegrini-Issac, M., Col-

liot, O., Sarazin, M., Dubois, B., Lehericy, S., and Benali, H. (2009). Support vector

machine-based classification of Alzheimer’s disease from whole-brain anatomical MRI.

Neuroradiology, 51(2):73–83.

[Marcus et al., 2010] Marcus, D. S., Fotenos, A. F., Csernansky, J. G., Morris, J. C., and

Buckner, R. L. (2010). Open Access Series of Imaging Studies (OASIS): Longitudinal MRI

Data in Nondemented and Demented Older Adults. Journal of cognitive neuroscience,

22(12):2677–2684.

[McKhann et al., 2011] McKhann, G. M., Knopman, D. S., Chertkow, H., Hyman, B. T.,

Jack, C. R., Kawas, C. H., Klunk, W. E., Koroshetz, W. J., Manly, J. J., Mayeux, R.,

Mohs, R. C., Morris, J. C., Rossor, M. N., Scheltens, P., Carrillo, M. C., Thies, B.,

Weintraub, S., and Phelps, C. H. (2011). The diagnosis of dementia due to Alzheimer’s

disease: recommendations from the National Institute on Aging-Alzheimer’s Association

workgroups on diagnostic guidelines for Alzheimer’s disease. Alzheimer’s & dementia :

the journal of the Alzheimer’s Association, 7(3):263–9.

[Mohamed et al., 2011] Mohamed, A. R., Sainath, T. N., Dahl, G., Ramabhadran, B., Hin-

ton, G. E., and Picheny, M. a. (2011). Deep belief networks using discriminative features

for phone recognition. ICASSP, IEEE International Conference on Acoustics, Speech and

Signal Processing - Proceedings, pages 5060–5063.

[Montes-Restrepo et al., 2014] Montes-Restrepo, V., Van Mierlo, P., Strobbe, G., Staelens,

S., Vandenberghe, S., and Hallez, H. (2014). Influence of skull modeling approaches on

EEG source localization. Brain Topography, 27:95–111.

[Moradi et al., 2015] Moradi, E., Pepe, A., Gaser, C., Huttunen, H., and Tohka, J. (2015).

Machine learning framework for early MRI-based Alzheimer’s conversion prediction in

Bibliography 77

MCI subjects. NeuroImage, 104:398–412.

[Morejon and Principe, 2004] Morejon, R. and Principe, J. (2004). Advanced search algo-

rithms for information-theoretic learning with kernel-based estimators. Neural Networks,

IEEE Transactions on, 15(4):874–884.

[Ota et al., 2014] Ota, K., Oishi, N., Ito, K., and Fukuyama, H. (2014). A comparison of

three brain atlases for MCI prediction. Journal of Neuroscience Methods, 221:139–150.

[Ota et al., 2015] Ota, K., Oishi, N., Ito, K., and Fukuyama, H. (2015). Effects of imaging

modalities, brain atlases and feature selection on prediction of Alzheimer’s disease. Journal

of neuroscience methods, 256:168–83.

[Papakostas et al., 2015] Papakostas, G., Savio, A., Grana, M., and Kaburlasos, V. (2015).

A lattice computing approach to Alzheimer’s disease computer assisted diagnosis based

on MRI data. Neurocomputing, 150:37–42.

[Paragios and Deriche, 2002] Paragios, N. and Deriche, R. (2002). Geodesic Active Regions

and Level Set Methods for Supervised Texture Segmentation. International Journal of

Computer Vision, 46(3):223–247.

[Principe, 2010] Principe, J. C. (2010). Information theoretic learning: Renyi’s entropy and

kernel perspectives. Springer.

[Ramırez et al., 2013] Ramırez, J., Gorriz, J., Salas-Gonzalez, D., Romero, A., Lopez, M.,

Alvarez, I., and Gomez-Rıo, M. (2013). Computer-aided diagnosis of Alzheimer’s type

dementia combining support vector machines and discriminant set of features. Information

Sciences, 237:59–72.

[Ramırez et al., 2016] Ramırez, J., Gorriz, J. M., Ortiz, A., Padilla, P., Martınez-murcia,

F. J., and Neuroimaging, D. (2016). Ensemble Tree Learning Techniques forMagnetic

Resonance Image Analysis Javier. In Innovation in Medicine and Healthcare 2015, vol-

ume 45, pages 395–404. Springer International Publishing.

[Ranzato et al., 2007] Ranzato, M., Poultney, C., Chopra, S., and Cun, Y. L. (2007). Ef-

ficient Learning of Sparse Representations with an Energy-Based Model. Advances in

Neural Information Processing Systems, pages 1137–1144.

[Rousseau et al., 2011] Rousseau, F., Habas, P. A., and Studholme, C. (2011). A supervised

patch-based approach for human brain labeling. IEEE Transactions on Medical Imaging,

30(10):1852–1862.

[Roy et al., 2014] Roy, S., Carass, A., Prince, J. L., and Pham, D. L. (2014). Subject Specific

78 Bibliography

Sparse Dictionary Learning for Atlas based Brain MRI Segmentation. Mach Learn Med

Imaging, 8679:248–255.

[Sabuncu and Konukoglu, 2015] Sabuncu, M. R. and Konukoglu, E. (2015). Clinical predic-

tion from structural brain MRI scans: a large-scale empirical study. Neuroinformatics,

13(1):31–46.

[Scholkopf and Smola, 2002] Scholkopf, B. and Smola, A. J. (2002). Learning with Kernels.

The MIT Press, Cambridge, MA, USA.

[Segonne et al., 2004] Segonne, F., Dale, a. M., Busa, E., Glessner, M., Salat, D., Hahn,

H. K., and Fischl, B. (2004). A hybrid approach to the skull stripping problem in MRI.

NeuroImage, 22(3):1060–75.

[Segonne et al., 2007] Segonne, F., Pacheco, J., and Fischl, B. (2007). Geometrically accu-

rate topology-correction of cortical surfaces using nonseparating loops. IEEE Transactions

on Medical Imaging, 26(4):518–529.

[Sikka et al., 2009] Sikka, K., Sinha, N., Singh, P. K., and Mishra, A. K. (2009). A fully

automated algorithm under modified FCM framework for improved brain MR image seg-

mentation. Magnetic resonance imaging, 27(7):994–1004.

[Sled et al., 1998] Sled, J. G., Zijdenbos, a. P., and Evans, a. C. (1998). A nonparametric

method for automatic correction of intensity nonuniformity in MRI data. IEEE transac-

tions on medical imaging, 17(1):87–97.

[Sørensen et al., 2013] Sørensen, L., Pai, A., Igel, C., and Nielsen, M. (2013). Hippocampal

texture predicts conversion from MCI to Alzheimer’s disease. Alzheimer’s & Dementia,

9(4):P581.

[Steen et al., 2000] Steen, R. G., Reddick, W. E., and Ogg, R. J. (2000). More than meets

the eye: significant regional heterogeneity in human cortical T1. Magnetic resonance

imaging, 18(4):361–8.

[Sum and Cheung, 2008] Sum, K. W. and Cheung, P. Y. S. (2008). Vessel Extraction Under

Non-Uniform Illumination: A Level Set Approach. In IEEE transactions on bio-medical

engineering, volume 55, pages 358–360.

[Tong et al., 2013] Tong, T., Wolz, R., Coupe, P., Hajnal, J. V., and Rueckert, D. (2013).

Segmentation of MR images via discriminative dictionary learning and sparse coding:

Application to hippocampus labeling. NeuroImage, 76:11–23.

[Tong et al., 2015] Tong, T., Wolz, R., Wang, Z., Gao, Q., Misawa, K., Fujiwara, M., Mori,

Bibliography 79

K., Hajnal, J. V., and Rueckert, D. (2015). Discriminative Dictionary Learning for Ab-

dominal Multi-Organ Segmentation. Medical Image Analysis, 23:92–104.

[Tu and Bai, 2010] Tu, Z. and Bai, X. (2010). Auto-context and its application to high-level

vision tasks and 3D brain image segmentation. IEEE Transactions on Pattern Analysis

and Machine Intelligence, 32(10):1744–1757.

[Valdes-Hernandez et al., 2009] Valdes-Hernandez, P. A., von Ellenrieder, N., Ojeda-

Gonzalez, A., Kochen, S., Aleman-Gomez, Y., Muravchik, C., and Valdes-Sosa, P. A.

(2009). Approximate average head models for EEG source imaging. Journal of neuro-

science methods, 185:125–32.

[Vincent et al., 2010] Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol,

P.-A. (2010). Stacked Denoising Autoencoders: Learning Useful Representations in a

Deep Network with a Local Denoising Criterion. Journal of Machine Learning Research,

11(3):3371–3408.

[Vovk et al., 2007] Vovk, U., Pernus, F., and Likar, B. (2007). A review of methods for

correction of intensity inhomogeneity in MRI. IEEE transactions on medical imaging,

26(3):405–21.

[Wang and Wang, 2008] Wang, P. and Wang, H. (2008). A Modified FCM Algorithm for

MRI Brain Image Segmentation. 2008 International Seminar on Future BioMedical In-

formation Engineering, 32(8):685–698.

[Wang et al., 2010] Wang, X.-F., Huang, D.-S., and Xu, H. (2010). An efficient local Chan-

Vese model for image segmentation. Pattern Recognition, 43(3):603–618.

[Westman et al., 2013] Westman, E., Aguilar, C., Muehlboeck, J. S., and Simmons, A.

(2013). Regional magnetic resonance imaging measures for multivariate analysis in

Alzheimer’s disease and mild cognitive impairment. Brain Topography, 26(1):9–23.

[Weston et al., 2012] Weston, J., Ratle, F., Mobahi, H., and Collobert, R. (2012). Deep

learning via semi-supervised embedding. In Neural Networks: Tricks of the Trade, volume

7700 of Lecture Notes in Computer Science, pages 639–655. Springer Berlin Heidelberg.

[Wolz et al., 2011] Wolz, R., Julkunen, V., Koikkalainen, J., Niskanen, E., Zhang, D. P.,

Rueckert, D., Soininen, H., and Lotjonen, J. (2011). Multi-method analysis of MRI images

in early diagnostics of Alzheimer’s disease. PLoS ONE, 6(10):1–9.

[Wu et al., 2014] Wu, G., Wang, Q., Zhang, D., Nie, F., Huang, H., and Shen, D. (2014). A

generative probability model of joint label fusion for multi-atlas based brain segmentation.

Medical image analysis, 18(6):881–90.

80 Bibliography

[Zhang et al., 2012] Zhang, D., Guo, Q., Wu, G., and Shen, D. (2012). Sparse Patch-Based

Label Fusion for Multi-Atlas Segmentation. In Yap, P.-T., Liu, T., Shen, D., Westin,

C.-F., and Shen, L., editors, Multimodal Brain Image Analysis, volume 7509 of LNCS,

pages 94–102. Springer Berlin Heidelberg.

[Zhang et al., 2011] Zhang, D., Wang, Y., Zhou, L., Yuan, H., and Shen, D. (2011). Multi-

modal classification of Alzheimer’s disease and mild cognitive impairment. NeuroImage,

55(3):856–67.

[Zitova and Flusser, 2003] Zitova, B. and Flusser, J. (2003). Image registration methods: A

survey. Image and Vision Computing, 21(11):977–1000.

Bibliography 81

Biographical sketch

David Cardenas-Pena received the bachelor’s degree in elec-

tronic engineering and the M.Eng. degree in industrial automa-

tion from the Universidad Nacional de Colombia, Manizales-

Colombia, in 2008 and 2011, respectively. He finished his Ph.D.

degree in automatics with the Universidad Nacional de Colom-

bia in 2016. He has been a Research Assistant with the Signal

Processing and Recongnition Group since 2008 and Teaching

Assistant since 2012. His main research interests include ma-

chine learning and pattern recognition applied to biosignal and

image processing.

Kernel-based image analysis towards MRI segmentation and ... · Kernel-based image analysis towards...

Documents

Transcript of Kernel-based image analysis towards MRI segmentation and ... · Kernel-based image analysis towards...