Predictive Model for Blood-Brain Barrier

International Journal of Advance Foundation and Research in Computer (IJAFRC)

Volume 2, Issue 8, August - 2015. ISSN 2348 4853

11 | 2015, IJAFRC All Rights Reserved www.ijafrc.org

Predictive Model for Blood-Brain Barrier 1Dhanalakshmi N, 2Dr Asha Gowda Karegowda, 3Radha N.

1,2Siddagana Institute Of Technology, Tumkur, 3IRH Technologies Pvt. Ltd., Banglore

[email protected], [email protected]

A B S T R A C T

It is essential to determine whether a candidate molecule is capable of penetrating the Blood

Brain Barrier (BBB) in drug discovery and development. As all the molecules cannot enter into

brain, due to barrier, this barrier is known as blood-brain barrier. Blood brain barrier will not

allow all the molecules into the brain, only those molecules which contain the high concentration

with blood cells are allowed into brain. The objective of our work is to find which of the molecules

penetrate into the brain. Computational work is carried out using R tool on the dataset garnered

from forensic lab, Bengaluru. Among the 5 machine learning techniques namely SVM (Support

Vector Machine), Neural Network, Random Forest, Decision Tree and Multiple Linear Regression,

experimental result reviews that SVM (Support Vector Machine) gives better result compare to

other techniques for regression data and Decision Tree generates least error rate for

classification data.

Index Terms: Blood-Brain Barrier (BBB), Support Vector Machine, Decision Tree, Random Forest, Multi

Linear Regression, Logistic Regression, Neural Network.

I. INTRODUCTION

The bloodbrain barrier (BBB)[1-10] is a highly selective permeability barrier that separates the

circulating blood from the brain extracellular fluid (BECF) in the central nervous system (CNS). The

bloodbrain barrier is formed by brain endothelial cells, which are connected by tight junctions with an

extremely high electrical resistivity of at least 0.1 m. The bloodbrain barrier allows the passage of

water, some gases, and lipid-soluble molecules by passive diffusion, as well as the selective transport of

molecules such as glucose and amino acids that are crucial to neural function.

Data mining has been used intensively and expansively by several organizations. In healthcare, data

mining is becoming increasingly prevalent, if not increasingly necessary. Data mining applications can

prominently benefit all parties intricate in the healthcare industry. For example, data mining can help

healthcare insurers detect fraud and abuse, healthcare organizations make customer relationship

management decisions, physicians identify effective treatments and best practices, and patients receive

improved and more reasonable healthcare services. The huge amounts of data generated by healthcare

transactions are multifaceted and voluminous to be processed and analyzed by traditional methods. Data

mining provides the methodology and technology to transform these banks of data into useful

information for decision making [11,12, 21].

Data mining on medical data[20] has great potential to improve the treatment quality of hospitals and

increase the survival rate of patients. Medical data mining is one of crucial issues to get valuable clinical

knowledge from medical databases. Early prediction methods have become an seeming need in many

clinical areas. Clinical study has found initial detection and intervention to be vital for averting clinical

falling in patients at general hospital [13]. The paper is organized as follows. Section 2 briefs about the




work related to BBB. Section 3 describes R tool and the methodologies adopted for the current work.

Results are discussed in section 4 followed by conclusion and future work in Section 5.

II. RELATED WORK

Scott Doniger, et.al have used 50 molecules of of which 25 are active molecules and other 25 are in

active molecules, which has been divided into training dataset and test dataset randomly. Two different

algorithms are implemented namely Neural Network and Support Vector Machine. 30 validation sets

have done out of these 50 molecules. The results show that the support vector machine outperforms the

neural network. It was found that SVM can predict up to 96% of the molecules correctly, averaging 81.5%

where as neural network averages 75.7% [4]. An Artificial Neural Network (ANN) model has been

developed to predict the ratios of the steady-state concentrations of drugs in the brain to those in the

blood (logBB) from their molecular structural parameters [9]. Claudia Suenderhauf et.al have taken the

dataset consisting of 153 compounds and these molecules were compiled using more reliable in vivo BBB

permeability-surface area (logPS) products, which are obtained by direct internal carotid artery

perfusion. The open source Chemical Development Kit (CDK) was used to calculate physico-chemical

properties and descriptors. This data was split into two classes namely positively (CNSp+) and negatively

(CNSp) classiGied molecules refers to compounds with logPS values 2 and 3, respectively. The DTI

paradigm is an efficient and powerful method to solve even linearly inseparable problems. Two widely

used paradigms were used to induce decision trees. Decision tree built with the chi-squared automatic

interaction detector (CHAID) on CDK descriptors and Classification and regression tree (CART) based on

CDK descriptors [3]. Misha Denil et.al has taken dataset containing random179 molecules and calculated

using random forest algorithm. They compare this experimental values with theoretical values, it was

found that experimental values gave better result than the theoretical value [18].

III. METHODOLOGY USED

A. R Programming

R is a programming language and software environment for statistical computing and graphics. The R

language is widely used among statisticians and data miners for developing statistical software and

data analysis. Users can access R tool through a command-line interpreter. R Tool is a statistical

tool/platform/programming language which is free and open-source. It permits users to extend the

capabilities of R are extended through user-created packages, which allow specialized statistical

techniques, graphical devices import/export capabilities, reporting tools etc. [13-14].

B. Machine Learning Techniques

The following machine learning techniques have been experimented using r tool.

Decision Tree - A Decision Tree(DT) represents a set of rules that follows a hierarchy of classes and

values, used to classify the instances. An instance is classified by starting to test the attribute

specified by the root and then following the branch corresponding to the value of the attribute in the

instance. This process is then repeated for the sub-tree with root on the new node[17,18,21]. Package

rpart is to be included for Decision Tree in R tool [13-14].

Decision Tree has the following advantages:

Can be applied to any type of data

The final structure of the classifier is quite simple and can be stored and handled in a graceful

manner




Handles very proficiently conditional information, subdividing the space into sub-spaces that are

handled individually.

Reveal normally robust and unresponsive to misclassification in the training set.

Random Forest - The Random Forest (RF) algorithm is based on the features of decision trees, but in its

place of having only one tree, there is a group of decision trees. The algorithm grows many result trees, in

order to improve predictive accuracy. It classifies one case using each tree in the new forest, and select a

final predicted outcome by conjoining the results through all trees using majority vote [18]. Package rf is

to be included for Random Forest in R tool [13-14].

Features of Random Forests include:

It is unexcelled in accuracy among current algorithms.

It runs competently on huge data bases.

It can handle thousands of input variables without variable deletion.

It gives approximations of what variables are significant in the classification.

Neural Network - ANN has been extensively in the field of healthcare. Neural networks is a non-linear

statistical data modeling used for classifications tasks. It makes use of interconnected artificial neurons to

process information those changes through an iterative process, where weights between neurons are

successively corrected. Neural networks are highly subtle to the data and generally they have reduced

ability to extrapolate beyond the restrictions of the input variables [19-21]. Package nnet is to be

included for Neural Network in R tool [13-14].

Support Vector Machine - A support vector machine (SVM) searches for support vectors which are

observations that are found to lie at the edge of an area in space which presents a boundary between one

of these classes of observations. SVM is used to classify the data which is non-separable data [21].

Package e1071 is to be included for Support Vector Machine in R tool [13-14].

IV. RESULTS AND DISCUSSION

A. Dataset

There were 1665 molecular descriptors which are concentrated with blood cells. 1665 molecular

descriptors is very complicated to calculate and find result. Hence weka(Waikato Environment for

Knowledge Analysis) tool is used for selecting significant features CfsSubsetEval module followed

by associate F-stepping (leave one out) . From these CfsubsetEval module 77 molecular descriptors are

selected based experimental logBB values. From these 77 molecular descriptors further reduce to 13

descriptors. These 13 molecular descriptors are highly concentrated with blood cells, molecules which

contain these descriptors are entered into brain. 135 compounds such as benzene, cyclopropane,

Aminopyrine, isoflurane, methane, propranolol, hydroxyzine, nitrous oxide, etc.. are selected and

produced as a dataset. This dataset is read in R tool/software. Table 1 gives the description about the 13

molecular descriptors.

logBB value i.e BloodBrain distribution concentration is computed. Experimental values that have

logBB>=0 is labeled as BBB+ and those which have logBB




The dataset is used both for regression model and classification model, where regression analysis mainly

deal with the continuous data and classification analysis mainly deal with discrete data.

Performance measures used for regression model [9-10] is R-Square for techniques Decision Tree,

Random Forest, Neural Network, Support Vector Machine and Multi Linear Regression.

R-squared as the square of the correlation - The term "R-squared" is derived from this definition. R-

squared is the square of the correlation between the model's predicted values and the actual values. This

correlation can range from -1 to 1, and so the square of the correlation then ranges from 0 to 1. The

greater the magnitude of the correlation between the predicted values and the actual values, the greater

the R-squared, regardless of whether the correlation is positive or negative.

Figure 1 to 5 shows R-square for regression model for Decision Tree, Random Forest, Neural Network,

Support Vector Machine and Multi Linear Regression respectively. In figure 1-5 blue line indicate the true

points with best fit and black line along diagonal representing perfect correlation. The R-square value for

regression data using Decision Tree, Random Forest, Neural Network, Support Vector Machine and Multi

Linear Regression is found to be 0.4591, 0.7388, 0.7723, 0.8845, 0.7676 respectively as shown in figure

1-5. For the regression model among the 5 classifiers SVM resulted in the best R-square value.

Performance measure used for classification model is over all error rate for techniques Decision Tree,

Random Forest, Neural Network, Support Vector Machine. Figure 6-9 gives over all error rate measure

for classification model for Decision Tree, Random Forest, Neural Network, and Support Vector Machine

respectively. Decision Tree constructed for classification data is shown in figure 6. It shows that TSPA

(NO), Rle., Mor04 are significant molecular descriptors. Left sub-plot of figure 7 Conditional variable

importance calculated by randomly shuffling the values of a given dataset. Then, the difference of the

model accuracy before and after the random permutations, averaged over all trees in the forest, tells us

how important that predictor is for determining the outcome. For the right sub-plot in figure 7

experiments have conducted using 100 trees and number of variables tried at each split is 2. The final

measure of importance is the total decrease in a decision tree node's impurity (the splitting criterion)

when splitting on a variable. The splitting criterion used is the Gini index. This is measured for a variable

over all trees giving a measure of the mean decrease in the Gini index of diversity relating to the variable.

Based on this experiment left sub-plot of figure 8 indicates that TPSA(NO), Rle. , noPH are top 3

significant molecular descriptors, whereas right sub-plot of figure 8 indicates that TPSA(NO), Mor04m,

MATS5m are top 3 significant molecular descriptors. Neural network of (13-3-1) is shown in figure 8.

Figure 9 shows the outcome of Support Vector Machine for classification data where circles represent the

BBB+ train and dark circles represent the BBB +test, triangles represent the BBB- train and dark triangle

represent BBB- test where the BBB+ molecules will penetrate into the brain.

Figure 10 gives the comparison between classifiers for regression model for Decision Tree, Random

Forest, Neural Network, Support Vector Machine and Multi Linear Regression respectively. SVM provides

best classification for regression data. Figure 11 gives the comparison between classifiers for

classification model for Decision Tree, Random Forest, Neural Network, Support Vector Machine

respectively. Decision Tree generates least error rate for classification data.




Fig 1. Predicted Vs Observed logBBvalue using Decision Tree model for regression data

Fi

g

2. Predicted Vs Observed logBB value using Random Forest model for regression data




Fig 3. Predicted Vs Observed logBB value using Neural Network model for regression data

Fig 4. Predicted Vs Observed logBB value using Support Vector Machine model for regression data




Fig 5. Predicted Vs Observed logBB value using Multi Linear Regression model for regression data

Fig 6. Decision Tree for classification data




Fig 7. Random Forest model for classification data

Fig 8. Neural Network model for classification data




Fig 9. Outcome of Support Vector Machine model for classification data

Fig 10. Comparison of classifiers for regression data




Fig 11. Comparison of classifiers for classification data

Table 1. Description of 13 molecular descriptors

Sl

no

Name Description

1 No Number of Oxygen atoms

2 BIC1 Bond Information Content index (neighborhood symmetry of 1-order)

3 MATS5m Moran autocorrelation of lag 5 weighted by mass

4 MATS5v Moran autocorrelation of lag 5 weighted by van der Waals volume

5 Mor04m signal 04 / weighted by mass

6 R1e+ R maximal autocorrelation of lag 1 / weighted by Sanderson

electronegativity

7 nArNR2 number of tertiary amines (aromatic)

8 nOHp number of primary alcohols

9 C-028 R--CRX

10 C-034 R--CR..X

11 H-051 H attached to alpha-C

12 O-057 phenol / enol / carboxyl OH

13 TPSA(NO) Topological polar surface area using N,O polar contributions

V. CONCLUSIONS AND FUTURE WORK

Blood brain barrier will not allow all the molecules into the brain, only those molecules which contain the

high concentration with blood cells are allowed into brain. Earlier studies used manually selected

descriptors for prediction. The objective of our work is to find which of the molecules penetrate into the

brain. Weka tool has been used to find the 13 significant descriptors out of 1665 descriptors. These

descriptors are highly correlated with the log BB property. Experiments have been conducted using




Decision Tree, AdaBoost, Random Forest, SVM, and Neural Networks for regression data and

classification data. This work is at most important for the pharmacy department to find which compound

penetrates into the brain based on 13 significant molecules. Experiments have been conducted using

with 137 compounds, in future we would like to work with 150 more compounds, other than 137

compounds used for the current work. In addition as part of further work, authors would like to explore

computational work using many more data mining techniques like KNN, Nave Bayes, Bayesian classifier

and ensemble learning like stacking, voting, grading, bagging and many more. Furthermore, authors

would like to adopt various bio inspired optimization techniques for significant feature selection which

would not only improve the performance of the classifiers but also reduce the computation time.

VI. REFERENCES

[1] Xingrong Liu, Meihua Tu, Rebecca S. Kelly, Cuiping Chen, Bill J. Smith, Development of a

Computational Approach to Predict Blood-Brain Barrier Permeability, ASPET Journals (The

American Society for Pharmacology and Experimental Therapeutics), Vol. 32(1), pp.132-

139,2014.

[2] William M Pardridge, Blood-brain barrier biology and methodology, Journal of NeuroVirology,

Vol. 5, pp. 556 569, 1999.

[3] Claudia Suenderhauf , Felix Hammann and Jorg Huwyler, Computational Prediction of Blood-

Brain Barrier Permeability Using Decision Tree Induction, journal molecules, Vol. 17, pp. 10429-

10445, 2015.

[4] Scott Doniger,Thomas Hofmann and Joanne Yeh, Predicting CNS Permeability of Drug Molecules:

Comparison of Neural Network and Support Vector Machine Algorithms, journal of computational

biology, Vol. 9(.6), pp. 849864, 2002.

[5] Pardridge, W. CNS drug design based on principles of blood-brain barrier transport. J.

Neurochemistry Vol. 70(5), pp.17811792, 1998.

[6] Norinder U and Haeberlein M , Computational approaches to the prediction of the blood-brain

distribution. Adv Drug Deliv Rev, Vol. 54 pp.291313, 2002.

[7] Platts JA, Abraham MH, Zhao YH, Hersey A, Ijaz L, and Butina D , Correlation and prediction of a

large blood-brain distribution data setan LFER study. Eur J Med Chem , Vol.36, pp.719

730,2001.

[8] Thomas Hofmann, Joanne Yeh, Predicting CNS using support vector machine algorithm, J. Comput.

Biol, Vol.10, pp.549558, 2002.

[9] Prabha Garg and Jitender Verma In Silico Prediction of Blood Brain Barrier Permeability: An

Artificial Neural Network Model, J. Chem. Inf. Model, Vol.46, pp.289-297, 2006.

[10] Keseru, G.M, A neural network based virtual high throughput screening test for the prediction of

CNS activity, Comb.Chem. High Throughput Screen, Vol. 3, pp.535540, 2000.

[11] Milley, A. Healthcare and data mining. Health Management Technology, Vol. 21(8), pp. 44-47,

2000.




[12] Yi Mao, Yixin Chen, Gregory Hackmann, Minmin Chen, henyang Lu, Marin Kollef, Thomas C. Bailey,

Early Deterioration Warning for Hospitalized Patients by Mining Clinical Data, International

Journal of Knowledge Discovery in Bioinformatics, Vol. 2(3), pp.1-20, 2011.

[13] Yanchang Zhao, Yonghua Cen, Data Mining Applications with R, Academic Press, 2013.

[14] Garrett Grolemund, Hands-On Programming with R: Write Your Own Functions and Simulations,

Shroff/OReilly publications, 2014.

[15] Freese, Jeremy and J. Scott Long. Regression Models for Categorical Dependent Variables Using,

Stata Press, 2006.

[16] Long, J. Scott. Regression Models for Categorical and Limited Dependent Variables, Sage

Publications, 1997.

[17] Quinlan, J. R, Induction of decision trees, Machine Learning, Vol. 1(1), pp. 81-106, 1986.

[18] Breiman, L, Misha, Random forests, Machine Learning, Vol. 45(1), pp.34-39, 2001.

[19] S. Haykin, Neural Networks- A comprehensive foundation, Macmillan Press, New York, (1994).

[20] Siri Krishan Wasan, Vasudha Bhatnagar and Harleen Kaur, The Impact of Data Mining

Techniques on Medical Diagnostics, Data Science Journal, vol 5, (2006).

[21] J. Han, and M. Kamber, Data Mining: Concepts and Techniques, San Francisco, Morgan Kauffmann

Publishers, 2001.

Predictive Model for Blood-Brain Barrier

Documents

Transcript of Predictive Model for Blood-Brain Barrier