1
Eight-Week Report
Registration / Application No: ENGS2229S
Name: Amogha P
Institution: National Institute of Technology, Surathkal
Abstract:
Alzheimer's disease is one of the main problem haunting the world today. According to the latest
study 1 out of 8 Americans suffer from Alzheimer's disease. In developing countries like India, it
is even worse as number of people with dementia is presently 58% and by 2050, it is projected to
rise to 71% according to study conducted by Alzheimer's Disease International. It is important
to have a method diagnosis, so that AD can be identified. The only way to confirm AD is
Autopsy of Brain, Hence we need in vivo methods to identify AD. The present study, is to
identify AD with the help of support vector machine. The Axial Images of 302 controls, 54 MCI
patients and 62 AD patients in which Hippocampus atrophy was visible was taken. These
images were processed to segment the images into four regions- Grey matter, white matter,
Skull, Voids( Atrophy and background) by improved fuzzy c-means methods and further Images
were classified using support vector machine and an accuracy of 79.18% was achieved. Hence, a
better classifier was studied, a Hidden neural network classifier, with this 85.56% accuracy was
achieved. This is a three class pattern classification problem, Hence the accuracy is very good for
a three class problem
2
Table of Contents
1. Introduction 3
2.Methodology and Discussion 4
a. Preprocessing 4
b. Image Segmentation 7
c. Training 10
d. Testing 14
3. Results
a. Support Vector Machine 15
b. Hidden Neural Network 16
4. Conclusion 18
5. References 19
3
Introduction
Alzheimer's disease is one of the most common neurodegenerative diseases[1]. However, the
disease can be confirmed only after autopsy of brain. Hence, there is a need for a in vivo method
for diagnosis of Alzheimer's disease[1,2]. The methods presently used are MRI, MRS, PET scan,
CT scan etc. However, diagnosis is based on Atrophy rating or some visual features, which are
observed by an expert. However, this method leads to bias, as each expert may rate atrophy
differently. Hence, there is a need for an automated method to diagnose Alzheimer's disease. The
automated method that I am trying to implement is to identify Alzheimer's disease from the MRI
image of the test subject. The problem to be solved here is a pattern recognition problem. There
are lot of classifiers available in the literature. The classifier I have used is support vector
Machine, This is because in this study the sample size(number of test subjects on whom MRI is
done=400) is much smaller than sample size (pixels in the image=256*256).
This is a classic problem of pattern recognition. Hence, naturally it involves these steps[3]:
1. Preprocessing of the MRI image
2. Segmentation of Image
3. Training Classifier based from MRI images
4. Validation
Preprocessing is done to remove the noise in the image and improve the contrast of the image,
The information in the image must be usable for SVM based classification hence, segmentation
into different anatomical parts in Brain is a way to extract features for SVM. These segmented
images are used for training the SVM to distinguish between AD and control subjects. Validation
is done to test the accuracy of SVM.
4
Methodology
1. Preprocessing
The MRI images used were from www.oasis-brains.org, this website has MRI images of 416
subjects. The subjects were also subjected to MMSE(Mini Mental State Exam) and hence, the
CDR(Clinical Dementia Rating) is obtained. A CDR of 1 and above indicates probable
Alzheimer's disease. The Images were processed using standard procedure, by applying Inverse
Fourier Transform and applying connected component analysis. For my study, I chose the
Coronal Image of brain as Medial Temporal Atrophy can be found out from this image.
Histogram equalization was applied to enhance, the intensity of the image. However, adaptive
histogram was found to outperform the normal histogram equalization.
Histogram Equalization:
Histogram equalization is common image contrast enhancement method. Consider a
discrete grayscale image {x} and let ni be the number of occurrences of gray level i. The
probability of an occurrence of a pixel of level i in the image is
L being the total number of gray levels in the image, n being the total number of pixels in the
image, and being in fact the image's histogram for pixel value i, normalized to [0,1].
The cumulative distribution function corresponding to px as
which is also the image's accumulated normalized histogram.
Transformation of the form y = T(x) to produce a new image {y}, such that its CDF will be
literalized across the value range, i.e.
5
for some constant K. The properties of the CDF allow us to perform such a transform it is
defined as
In order to map the values back into their original range, the following simple transformation
needs to be applied on the result[4]:
Adaptive Histogram(CLAHE):
CLAHE differs from ordinary adaptive histogram equalization in its contrast limiting. This
feature can also be applied to global histogram equalization, giving rise to contrast-limited
histogram equalization (CLHE), which is rarely used in practice. In the case of CLAHE, the
contrast limiting procedure has to be applied for each neighborhood from which a transformation
function is derived. CLAHE was developed to prevent the overamplification of noise that
adaptive histogram equalization can give rise to.
This is achieved by limiting the contrast enhancement of AHE. The contrast amplification in the
vicinity of a given pixel value is given by the slope of the transformation function. This is
proportional to the slope of the neighbourhood cumulative distribution function (CDF) and
therefore to the value of the histogram at that pixel value. CLAHE limits the amplification by
clipping the histogram at a predefined value before computing the CDF. This limits the slope of
the CDF and therefore of the transformation function. The value at which the histogram is
clipped, the so-called clip limit, depends on the normalization of the histogram and thereby on
the size of the neighbourhood region. Common values limit the resulting amplification to
between 3 and 4. It is advantageous not to discard the part of the histogram that exceeds the clip
limit but to redistribute it equally among all histogram bins. The redistribution will push some
bins over the clip limit again (region shaded green in the figure), resulting in an effective clip
limit that is larger than the prescribed limit and the exact value of which depends on the image. If
this is undesirable, the redistribution procedure can be repeated recursively until the excess is
negligible[5].
6
When Normal histogram equalization was applied to an image with greater voids, the contrast
equalization makes the white matter intensity dull. Hence, a problem in segmentation could be
possible. The images of this normal segmentation are attached below. Hence, an adaptive
histogram equalization is used. fig 1(a) and fig1(b) are attached to show the results, of both
methods, adaptive enhancement performs better hence, it was applied.
The above images are preprocessed images, 1(a). The image is equalized by normal histogram equalization, 1(b).
The image is enhanced by adaptive histogram.
Noise Removal:
After Image contrast adjustment, the noise gets boosted, Hence, noise removal is important to
have proper segmentation. The noise removal was done by wiener adaptive filter. Wiener filter
estimates the local mean and variance around each pixel[6].
where is the N-by-M local neighborhood of each pixel in the image A. Pixelwise Wiener filter
are created using these estimates,
where ν2 is the noise variance. Noise Variance is the average of all the local estimated variances.
7
2. Segmentation of Image
The FCM algorithm assigns pixels to each category by using fuzzy memberships. Let
denotes an image with N pixels to be partitioned into c clusters, where represents
multispectral (features) data. The algorithm is an iterative optimization that minimizes the cost
function defined as follows:
where represents the membership of pixel in the ith cluster, vi is the ith cluster center, || || is
a norm metric, and m is a constant. The parameter m controls the fuzziness of the resulting
partition, and m=2 is used in this study.
The cost function is minimized when pixels close to the centroid of their clusters are assigned
high membership values, and low membership values are assigned to pixels with data far from
the centroid. The membership function represents the probability that a pixel belongs to a
specific cluster. In the FCM algorithm, the probability is dependent solely on the distance
between the pixel and each individual cluster center in the feature domain. The membership
functions and cluster centers are updated by the following[7]:
8
Starting with an initial guess for each cluster center, the FCM converges to a solution for vi
representing the local minimum or a saddle point of the cost function. Convergence can be
detected by comparing the changes in the membership function or the cluster center at two
successive iteration steps.
However, this method treats whole image as a column vector and segments the image. However,
in an MRI it is a known fact that all white matter is a connected set. Also, all the grey matter is
connected set. Hence, small change was done in algorithm to account for this fact. Hence, the
Membership matrix is convolved by a 5x5 ones matrix and the update rule is modified so that the
neighboring elements probability is added up to the given pixel, hence connectivity is ensured[8]
This spatial relationship is important in clustering, but it is not utilized in a standard FCM
algorithm. To exploit the spatial information, a spatial function is defined as
where NB(xj) represents a square window centered on pixel xj in the spatial domain. A 5*5
window was used throughout this work. Just like the membership function, the spatial function
hij represents the probability that pixel xj belongs to ith cluster. The spatial function of a pixel for
a cluster is large if the majority of its neighborhood belongs to the same clusters. The spatial
function is incorporated into membership function as follows:
where p and q are parameters to control the relative importance of both functions. In a
homogenous region, the spatial functions simply fortify the original membership, and the
clustering result remains unchanged. However, for a noisy pixel, this formula reduces the
weighting of a noisy cluster by the labels of its neighboring pixels. As a result, misclassified
pixels from noisy regions or spurious blobs can easily be corrected. The spatial FCM with
parameter p and q is denoted sFCMp,q. Note that sFCM1,0 is identical to the conventional FCM.
9
The clustering is a two-pass process at each iteration. The first pass is the same as that in
standard FCM to calculate the membership function in the spectral domain. In the second pass,
the membership information of each pixel is mapped to the spatial domain, and the spatial
function is computed from that. The FCM iteration proceeds with the new membership that is
incorporated with the spatial function. The iteration is stopped when the maximum difference
between two cluster centers at two successive iterations is less than a threshold (Z0.02). After the
convergence, defuzzification is applied to assign each pixel to a specific cluster for which the
membership is maximal[8].
The pictures shown above, 2(a). Original image 2(b). Segmented by fuzzy c-means 2(c). Segmented by improved
fuzzy c-means algorithm. The arrows in the third image show the atrophy (Medial Temporal Atrophy). Thus the
second segmentation method outperforms the first one
10
3. Classifier Training:
1. Support Vector machine:
Support vector machines are one of the most common pattern recognition machines used in
Machine learning. It is considered as a successor to logistic regression due to the fact that it is
based on maximum class seperatability.
Mathematical Basis of Support Vector Machine:
a. An Intuition:
Support vectors machine aims at maximizing the distance between the two class vectors and the
discriminatory function. Hence the method is less probable to errors in testing phase than any
other method like logistic regression which just increases the expectation of the probability
density function. Thus, Support vector machines are called margin maximizers[9].
b. Mathematical definition
Problem:
Given a vector and , for given vector , is called a trusted source.
We need to find probability distribution , assuming that the data is IID (Independently
drawn and identically distributed).
Solution:
Consider, a machine or simply a function, f(X, α), where α is the training parameter.
Expectation of the tests error:
Where
11
R(α) cannot be calculated as both and are not known, so a bound is to be found
out such that the limit of the bound approximates
Consider,
Now it can be proved that,
We need to find h, such that the RHS is almost equal to LHS, then R(α) can be estimated, h is
called VC dimension which is an integer, thus a perfect fit cannot be described, but a best fit is
described by changing VC dimensions.
Linear SVM:
It is one of the most important models used in neuro applications as the sample size is smaller
than the dimensionality of the system in many cases.
Mathematical Basis:
If W is the hyper plane, since the hyperplane or discriminatory function is linear,
we must have
The optimum W is decided by optimizing "Margin".
i.e.,
Thus the distance optimization gives rise to
, hence needs to be minimized to maximize
the distance[10].
However, doing this optimization directly is difficult. Hence, convex optimization tools are used
to get the solution,
One of the approach, commonly followed is Lagrange Multipliers,
12
Lagrange Multipliers[11]:
Lagrange multipliers are used in optimization problems,
subject to:
Consider a two dimension problem to be solved by Lagrange Multipliers,
is a curve in 3 dimension. However, using makes the curve a constrained
curve restricted to 2 dimensions, differentiating the curve w.r to x, we get
Consider a two dimensional vector, T the tangent to the curve,
Consider a two dimensional vector,
Equation 1, can be written as , Also we need to minimize and this is possible
only if , This means that and are parallel to each other, which can be
equivalently written as,
consider, ,
Solving the above three linear equations simultaneously, we the value of the Lagrange
multiplier, hence by substituting the value of in the same set of equations we get the values of
y and x. x corresponds to minima point and y is the minima.
Now, we can use the concept of Lagrange multipliers shown above to solve the problem of ||w||
minimization. Here, we write the equation in terms of Lagrange multiplier αi, where i=1,2,...l
13
, Where Lp is the Lagrangian and α is the Lagrange multiplier
vector
The solution of this equation is got by quadratic programming for those points for which αi>0
are called support vectors, These points lie on the two hyperplanes H1 and H2, These points in
hyperplanes are important because even though the points which are not in hyperplanes(points
that are far away from discriminatory function) are removed, the discriminatory function does
not change. However, If the points on Hyperplanes are removed the discriminatory function
changes. Hence, in feature selection it is important, to make sure that feature matrix does not
exclude the support vectors[12].
The two dotted lines are hyperplanes H1 and H2 on which the support vectors lie, The margin
maximization yields support vectors and the classifier.
The above discussed is theoretical method, hence it is an analytical way of soling for the
solution. However, once the sample size becomes large it is hard to solve the equation, as seen
from the example a two sample problem requires simultaneous solution of 3 equations, However,
there is no guarantee on the linearity of the equations, Hence, the algorithmic solution is required
for solving the equation. Hence, an iterative method is required to solve for solution[13].
14
Dimensionality reduction:
To apply the SVM algorithm effectively the dimensionality as to be reduced. The feature matrix
would have high dimensionality, if an image is considered. Therefore PCA is applied on the data
to reduce the dimension. PCA stands for Principal component analysis, which eliminates the
vectors having higher variance than the threshold. The technique uses covariance matrix. LDA is
applied on the data to align the data to give the data greater class seperatability[14].
2. Hidden Neural Network
HMMs are based on a number of assumptions that limit their classification abilities. Combining
the CHMM framework with neural networks can lead to a more flexible and powerful model for
classification. The basic idea of the HNN presented here is to replace the probability parameters
of the CHMM by state-specific multilayer perceptrons that take the observations as input. Thus,
in the HNN, it is possible to assign up to three networks to each state: (1) a match network
outputting the “probability” that the current observation matches a given state, (2) a transition
network that outputs transition “probabilities” dependent on observations, and (3) a label
network that outputs the probability of the different labels in this state[15].
HMMs give very good results for problem of any size. In the problem, we have a data of 416
subjects. hence, 60% of the data was used for training by picking a random seed and 15% for
validation and 25% for testing. The random seed pick was done multiple times to reduce the bias
in the result and the median of the results was considered the result.
To validate the results the number of hidden neurons were varied as 10,15,20, and the best nueral
network was chosen. The network was formed by descent algorithm fitting, subject to iteration
and minimum difference constraints. The neural network hence formed is a good network and
hence the test results are very good
15
3. Results :
As discussed in previous sections, pattern classification was carried out by two models, support
Vector Machine and Hidden Neural Network
a. Support Vector Machine:
The image data was preprocessed and reduced in dimensionality as described in previous
sections to reduce the dimensionality to 400. The training was done in two was a twofold and a
n fold approach
1. Twofold approach:
Here the image data was randomly divided into two data sets and support vector training was
done with one of the data and testing was done with remaining data set. Hence, 50% of data was
used for training and 50% of data was used for training.
2. n-fold approach:
Here the image data was randomly classified into n data sets, n=10. n-1 data sets were used for
training and another data set was used for testing. This process was repeated for various, n
combinations of testing and training. The best model out of the n models was chosen as the
support vector machine
Method Accuracy Comments
2-fold 75.46 Less accuracy as training sample
is small
n-fold 79.18 Greater accuracy due to cyclic
model selection and higher
training size
Hence, The support vector machine model was able to classify the images as control, MCI and
AD, with a accuracy of 79.18, which is a good accuracy for a three class problem.
16
b. Hidden Neural Network:
Hidden Neural network was trained with :
a. 60% sample
b. 70% sample
The number of neurons was taken as:
a. 10
b.15
c.20
The training and testing was done for these combinations, by keeping validation sample at 15%
and test sample at 25% and 15% respectively, the best results were obtained for 70% training
with 15 hidden neurons. In order to confirm the results the training was redone with different
random data selection to ensure robustness of the results, and the results obtained were 93.76%,
87.98, 85.5% for three random training and testing. Hence, the model with lower accuracy was
chosen to eliminate the bias in the data. This is because, by choosing good data points for
training a good accuracy can to be achieved. However, a more important thing is to get a good
accuracy with any data point or image. Hence, a model with lowest accuracy was chosen.
Graph showing error reduction with increased epochs, for training, Validation and Testing,
and the best one is chosen
17
Graph of gradient and validation check at last epoch
A complete confusion matrix, for the model,
class 1- Control, class 2- MCI, class 3- AD patient
18
4.Conclusion:
The method discussed above gives very good accuracy of 85.5% , This means we can
effectively, diagnose AD and also track the progression as it classifies subjects in category of
MCI. Identifying MCI is very important in AD as most of the MCI, about 80% get converted to
AD in the period of 5 years, Hence precautionary steps can be taken for such patients[16].
This method can be further improved if a region of interest based approach like only
hippocampal area is taken. However, for such an approach a high quality MRI image is
necessary. In this, I used the images of resolution of 176x176, in which hippocampus covers a
mere area of 10x10, Hence results will not be good as segmentation is not possible for low
resolution images as a clear boundary cannot be drawn. If higher resolution images are available,
this method could be tried[17].
19
5. References:
[1]Mayeux R. Early Alzheimer's disease. N Engl J Med. 2010 Jun 10;362(4):2194-2201
[2] P. Scheltens, “Early diagnosis of dementia: neuro-imaging,” J Neuro, vol. 246, pp. 16-20,
1999.
[3] Evanthia E. et al., " A supervised method to assist the diagnosis of Alzheimer’s Diseasebased
on functional Magnetic Resonance Imaging" Proceedings of the 29th Annual International
Conference of the IEEE EMBS Cité Internationale, Lyon, France pp.3426-3428 August 23-26,
2007.
[4] http://en.wikipedia.org/wiki/Histogram_equalization, accessed on 25th of July, 2012
[5] Stephen M. Pizer, E. Philip Amburn, John D. Austin, Robert Cromartie, Ari Geselowitz,
Trey,Greer, Bart ter Haar Romeny, John B. Zimmerman, Karel Zuiderveld, Adaptive histogram
equalization and its variations, Computer Vision, Graphics, and Image Processing, Volume 39,
Issue 3, September 1987, Pages 355-368
[6] A. Khireddine, K. Benmahammed, W. Puech, Digital image restoration by Wiener filter in
2D case, Advances in Engineering Software, Volume 38, Issue 7, July 2007, Pages 513-516,
ISSN 0965-9978
[7] Dao-Qiang Zhang, Song-Can Chen, A novel kernelized fuzzy C-means algorithm with
application in medical image segmentation, Artificial Intelligence in Medicine, Volume 32, Issue
1, September 2004, Pages 37-50, ISSN 0933-3657
[8] Keh-Shih Chuang, Hong-Long Tzeng, Sharon Chen, Jay Wu, Tzong-Jer Chen, Fuzzy c-
means clustering with spatial information for image segmentation, Computerized Medical
Imaging and Graphics, Volume 30, Issue 1, January 2006, Pages 9-15, ISSN 0895-6111
[9] Vapnik ,V.N., 2000. The Nature of Statistical Learning Theory, Springer, NY
[10] Nello Cristianini, John Shawe-Taylor.,2000. An Introduction to Support Vector Machines
and other kernel-based learning methods, Cambridge University Press
[11] Lagrange Multipliers, Com S 477/577 ,Nov 18, 2008
20
[12] Christopher J.C. Burges ,A Tutorial on Support Vector Machines for Pattern Recognition,
Kluwer Academic Publishers, Boston
[13] http://nlp.stanford.edu/IR-book/html/htmledition/nonlinear-svms-1.html, accessed on may
23, 2012 at 3:34pm IST.
[14] A. Blum and P. Langley. Selection of relevant features and examples in machine learning.
Artificial Intelligence, 97(1-2):245–271, December 1997.
[15] Terence D. Sanger, Optimal unsupervised learning in a single-layer linear feedforward
neural network, Neural Networks, Volume 2, Issue 6, 1989, Pages 459-473, ISSN 0893-6080
[16] Pravat K. Mandal ,Magnetic resonance spectroscopy (MRS) and its application in
Alzheimer's disease, Concepts in Magnetic Resonance Part, Volume 30A, Issue 1, pages 40–
64, January 200
Top Related