High-Dimensional Random Forests
Transcript of High-Dimensional Random Forests
University of Texas at El Paso University of Texas at El Paso
ScholarWorks@UTEP ScholarWorks@UTEP
Open Access Theses & Dissertations
2021-05-01
High-Dimensional Random Forests High-Dimensional Random Forests
Roland Fiagbe University of Texas at El Paso
Follow this and additional works at: https://scholarworks.utep.edu/open_etd
Part of the Statistics and Probability Commons
Recommended Citation Recommended Citation Fiagbe, Roland, "High-Dimensional Random Forests" (2021). Open Access Theses & Dissertations. 3252. https://scholarworks.utep.edu/open_etd/3252
This is brought to you for free and open access by ScholarWorks@UTEP. It has been accepted for inclusion in Open Access Theses & Dissertations by an authorized administrator of ScholarWorks@UTEP. For more information, please contact [email protected].
HIGH-DIMENSIONAL RANDOM FORESTS
ROLAND FIAGBE
Master’s Program in Statistics
APPROVED:
Xioagang Su, Ph.D., Chair
Wen-Yee Lee, Ph.D.
Naijun Sha, Ph.D.
Stephen L. Crites Jr., Ph.D.Dean of the Graduate School
c©Copyright
Roland Fiagbe
2021
Dedicated to my
Late FATHER John K. Fiagbe, MOTHER Janet Karikari, SIBLINGS Marcel and Adwoaand my ADVISOR Professor Xiaogang Su
with love
HIGH-DIMENSIONAL RANDOM FORESTS
by
ROLAND FIAGBE
THESIS
Presented to the Faculty of the Graduate School of
The University of Texas at El Paso
in Partial Fulfillment
of the Requirements
for the Degree of
MASTER OF SCIENCE
Department of Mathematical Sciences
THE UNIVERSITY OF TEXAS AT EL PASO
May 2021
Acknowledgements
First and foremost, it is a genuine pleasure to express my heartfelt gratitude to my re-
search supervisor, Professor Xiaogang Su of the Mathematical Sciences Department of the
University of Texas at El Paso, for his constant guidance, support, encouragement and en-
during patience. In spite of his busy schedules, he was always readily available to provide
clear explanations of concepts whenever I did not seem to make headway. Without his
dedication and persistent help this thesis would not have been possible.
Also, I would like to thank my committee members, Professor Naijun Sha of the Mathe-
matical Sciences Department, and Professor Wen-Yee Lee of the Chemistry Department
both at the University of Texas El Paso for their comments, suggestions, additional sup-
port and guidance which were very valuable to the completion of this work.
Last but not least, I am extremely thankful to my family, colleagues and friends for their
supports and inspirations and everyone who played a role in my academic accomplishments.
v
Abstract
The significant advances in technology have enabled easy collection and management of
high-dimensional data in many fields, however, the process of modeling these data imposes
a huge problem in the field of data science. Dealing with high-dimensional data is one
of the significant challenges that degenerate the performance and precision of most clas-
sification and regression algorithms, e.g., random forests. Random Forest (RF) is among
the few methods that can be extended to model high-dimensional data; nevertheless, its
performance and precision, like others, are highly affected by high dimensions, especially
when the dataset contains a huge number of noise or noninformative features. It is known
in literature that data dominated with high number of uninformative features have small
number of expected informative variables that could lead to the challenge of obtaining an
accurate or robust random forest model.
In this study, we present a new algorithm that incorporates ridge regression as a variable
screening tool to discern informative features in the setting of high dimensions and apply
the classical random forest to a top portion of selected important features. Simulation
studies on high dimensions are carried out to test how our proposed method addresses
the above problem and improves the performance of random forest models. To illustrate
our method, we applied it to a real life dataset (Communities and Crime Dataset), which
was sourced from the UCI database. The results show how variable screening using ridge
regression could be a very useful tool for building high-dimensional random forests.
vi
Table of Contents
Page
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Chapter
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Problems in High-Dimensional Random Forests . . . . . . . . . . . . . . . 3
1.2 Overview of Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Review of Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 History of Random Forests (RF) . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Definition of Random Forest Model . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Characterizing the Accuracy of Random Forests . . . . . . . . . . . . . . . 8
2.3.1 Convergence of Random Forests . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Strength and Correlation of Random Forests . . . . . . . . . . . . . 8
2.4 Out-of-Bag Sample and Out-of-Bag Error . . . . . . . . . . . . . . . . . . . 11
2.5 Mean Squared Error (MSE) . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6 Variable Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 Problems in High Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7.1 Simulation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.8 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 High-Dimensionl Random Forests with Ridge Regression Based Variable Screening 23
3.1 Ridge Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Ridge Penalty (λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
vii
3.3 Variable Screening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 Simulation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Comparing Performance of the Proposed Model to the Existing Model . . . 30
4.2.1 Simulation Results from Linear Model . . . . . . . . . . . . . . . . 32
4.2.2 Simulation Results from Tree Model . . . . . . . . . . . . . . . . . 35
5 Real Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1 Communities and Crime Dataset . . . . . . . . . . . . . . . . . . . . . . . 38
5.1.1 Presentation of Results from Real Data Application . . . . . . . . . 40
6 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.1 Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Appendix
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Curriculum Vitae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
viii
List of Tables
4.1 Average Prediction MSE Values for linear model 2.20 with ρ = 0 . . . . . . 32
4.2 Average Prediction MSE Values for linear model 2.20 with ρ = 0.5 . . . . . 33
4.3 Average Prediction MSE Values for tree model 2.22 with ρ = 0 . . . . . . . 35
4.4 Average Prediction MSE Values for tree model 2.22 with ρ = 0.5 . . . . . . 36
5.1 Results of analysis of Communities and Crime Dataset. Comapring the
proposed method to the existing (naive) random forest method in dealing
with high dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 Prediction MSE values of training and test set for the proposed method and
existing (naive) random forest method. . . . . . . . . . . . . . . . . . . . . 40
ix
List of Figures
2.1 Prediction MSE for linear model when ρ = 0 . . . . . . . . . . . . . . . . . 17
2.2 Prediction MSE for linear model when ρ = 0.5 . . . . . . . . . . . . . . . . 17
2.3 Prediction MSE for MARS model when ρ = 0 . . . . . . . . . . . . . . . . 18
2.4 Prediction MSE for MARS model when ρ = 0.5 . . . . . . . . . . . . . . . 18
2.5 Prediction MSE for tree model when ρ = 0 . . . . . . . . . . . . . . . . . . 19
2.6 Prediction MSE for tree model when ρ = 0.5 . . . . . . . . . . . . . . . . . 19
4.1 Prediction MSE and AMSE for the simulation runs of the three models:
Naive (mse, amse), Oracle (mse0, amse0), and Proposed (mse1, amse1) for
training dataset with ρ = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Prediction MSE and AMSE for the simulation runs of the three models:
Naive (mse, amse), Oracle (mse0, amse0), and Proposed (mse1, amse1) for
test dataset with ρ = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Prediction MSE and AMSE for the simulation runs of the three models:
Naive (mse, amse), Oracle (mse0, amse0), and Proposed (mse1, amse1) for
training dataset with ρ = 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Prediction MSE and AMSE for the simulation runs of the three models:
Naive (mse, amse), Oracle (mse0, amse0), and Proposed (mse1, amse1) for
test dataset with ρ = 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5 Prediction MSE and AMSE for the simulation runs of the three models:
Naive (mse, amse), Oracle (mse0, amse0), and Proposed (mse1, amse1) for
training dataset with ρ = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.6 Prediction MSE and AMSE for the simulation runs of the three models:
Naive (mse, amse), Oracle (mse0, amse0), and Proposed (mse1, amse1) for
test dataset with ρ = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
x
4.7 Prediction MSE and AMSE for the simulation runs of the three models:
Naive (mse, amse), Oracle (mse0, amse0), and Proposed (mse1, amse1) for
training dataset with ρ = 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.8 Prediction MSE and AMSE for the simulation runs of the three models:
Naive (mse, amse), Oracle (mse0, amse0), and Proposed (mse1, amse1) for
test dataset with ρ = 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1 Out-of-Bag Estimate Errors of both models . . . . . . . . . . . . . . . . . 41
5.2 Comparing Prediction MSE for Training and Test Sets . . . . . . . . . . . 41
xi
Chapter 1
Introduction
One of the current significant challenges in predictive modelling is high-dimensional data.
These high-dimensional data are usually found in areas such as ecology, medicine, or biol-
ogy where a large number of measurements are produced by DNA microarray technology,
finance, industries that produce text data etc. There are two special characteristics of
high-dimensional data that significantly contribute to the low performance of classification
and regression algorithms. One characteristic is that the data set usually contains different
classes of variables that do not relate to the response variable. For instance, in text data,
some features would relate to sports while others would relate to music. The other charac-
teristic is that, with high-dimensional data, there is a high chance that a large number of
variables are uninformative to the class feature. Uninformative variables are weakly corre-
lated to the class feature and therefore have low power in predicting the response feature.
Random Forest was first introduced by (Breiman, 2001) and it is known as a non-parametric
statistical learning method from the ensemble decision trees approaches. Random Forest
has been described in most recent studies as one of the state-of-the-art machine learning
methods for prediction and classification, and has been very successful in performing its
purposes (Capitaine et al., 2020). In short, it is referred to a classification technique that
is built with numerous decision trees. In competition with modern and advanced methods
such as Support Vector Machine (SVM) and boosting, random forest models have emerged
to be the fastest and easy to implement model, have highly accurate prediction results and
are able to handle a very large number of variables without overfitting (Biau, 2012). It
employes bagging and randomness when building each tree which results in creating an
uncorrelated forest of trees whose prediction is more accurate than the individual trees.
1
Based on bagging and random feature selection, a number of decision trees is generated and
majority voting is taken for classification (Kulkarni and Sinha, 2012) or by averaging over
the predictions of tree models for regression trees (Denil et al., 2014). Each tree in the for-
est depends on the values of randomly sampled data with replacement known as bootstrap
samples. For each tree grown on a bootstrap sample, the error rate for observations left
out of the bootstrap sample is monitored. This is called the out-of-bag (OOB) error rate.
Approximately one-third of the bootstrap sample is left out (known as out-of-bag sample)
and is used to compute unbiased classification error estimate as more trees are added to
the forest. This algorithm does not need further cross-validation to obtain estimate of the
test error or generalization error (Zhang and Zulkernine, 2006).
Since its introduction, random forests initially known as a single algorithm have been de-
veloped into an entire framework of models (Denil et al., 2014) and have seen significant
applications in various fields of science, technology and humanities. An early approach of
Random Forest was developed by (Breiman, 1996a) known as bagging. In this initial ap-
proach, a random selection without replacement is made from the training set to grow each
tree. Another approach is random split selection developed by (Dietterich, 1998) where at
each node the split is selected at random from among permissible splits. Breiman’s initial
approach of bagging (Breiman, 1996a) was effective in reducing the variance of regression
predictors but its major limitations were leaving the bias unchanged and high correlation
among trees. These concerns motivated the implementation of Random Forest. According
to Xu et al. (2012), among the numerous methods proposed to build random forest models
from subspaces, the method proposed by (Breiman, 2001) has been the mainstream and
most applied due to its higher performances compared to other methods.
In spite of the fact that there has been the latest and significant development in technol-
ogy that enables the collection and management of high-dimensional data in many fields,
the process of analyzing and handling these data has still been problematic in the field
of data science. Performing analysis and prediction with high-dimensional data has al-
ways been challenging owing to the large number of predictor variables relative to the
2
size of the sample. Random Forest (RF) is among the few methods that can be extended
to model high-dimensional data; nevertheless, its performance, like others, is affected by
high dimension, especially when the dataset contains a huge number of noise predictors
and complex correlations among variables. Moreover, the presence of correlated variables
in a high-dimensional dataset affects the ability of the random forest model to find the
true predictors, and this results in decreasing the estimated important scores of predictors
(Gregorutti et al., 2017). Recently, the issue of complexities and analytical challenges asso-
ciated with high-dimensional data in random forest models have been tackled by numerous
researchers. In this project, we try to come up with a novel feature screening method for
fitting a random forest model, hence we propose using Ridge Regression as method of
variable screening and apply the classical random forest to a top portion of important
variables.
1.1 Problems in High-Dimensional Random Forests
From past studies, random forest has been known to not have the best performance on
some data characteristics and when dealing with very high dimensional data. Breiman’s
method has been known to work well with certain dimensions of data but its performance
is low when dealing with very high dimensional data (Xu et al., 2012). One major step in-
volved in building a random forest model is the random selection of features when splitting
data (bootstrap samples) for growing classifiers (decision trees). However, this step does
not seem suitable for high dimensional data, because such type of data usually contains
many variables which are uninformative to the target variable. For a general random forest
model, there is no criterion for selecting informative variables. All variables are sampled by
uniform random sampling. As a result, random selection of splitting variables at each split
would contain less or no informative variables, especially in cases where the percentage
of truly informative variables is small. Uninformative variables (also termed noise vari-
ables) are weakly correlated to the target variable and therefore have low predictive power
3
with respect to the target variable. Therefore, weak trees are built in the forest and the
classification performance of the random forest model significantly declines (Amaratunga
et al., 2008). Xu et al. (2012) stated that the subspace sizes used to build each tree has
to be enlarged to help increase the chance of selecting truly informative characteristics.
This would produce a less noisy subspaces and improve the strenght of each tree in the
forest. Moreover, this algorithm involves numerous computations and is likely to increase
correlation among the resulting trees in the forest. Highly correlated trees also significantly
reduce the performance of random forest models (Breiman, 2001).
1.2 Overview of Study
The overview of this project is as follows: Chapter 2 highlights the history of random forest
model, some theorotical background, and some related works of high-dimensional random
forest. In chapter 3, we provide a detailed theorotical work on the proposed method, thus
applying ridge estimator as a variable screening method and applying a general random
forest model to a top portion of important features. Chapter 4 presents a simulation study
on illustrating the problems associated with high-dimensional data and a simulation study
on the proposed method. The proposed method is applied to a real life data set and the
results are presented in Chapter 5. Finally, chapter 6 presents some discussions, conclusion
and further studies.
4
Chapter 2
Review of Literature
This chapter briefly introduces the history of random forest, methods, justifications and
highlights some theorotical backgrounds of random forests. The difficulty encountered by
RF when facing high-dimensional data then is elaborated and several available approaches
in the literature are presented.
2.1 History of Random Forests (RF)
Random Forest (RF) is considered as a refinement of bagged trees. Its name was derived
from random decision forests that was initially proposed by Ho (1995). Breiman (1996a)
bagging idea and Ho (1995) random subspace method were combined to construct the
collection of decision trees with controlled variations. The work of Amit and Geman (1997)
greatly influenced Breiman’s notion in his early development of random forests. Amit and
Geman (1997) first presented the idea of growing an individual tree by searching over a
random subset of the accessible decisions at the node splitting stage. However, Dietterich
(2000) finally introduced the idea of randomized node optimization. His method proposed
that, the decision at each node must be selected by the process of randomization. After
putting these ideas together, the method of building a forest of trees was first established
by Breiman (2001). In order to deal with variations among the trees, the training data
was projected into a randomly sampled subspace (known as Bootstrap sample) before
growing the trees. His paper introduced a procedure of building a random forest consisting
of uncorrelated trees using the following procedures; Classification and Regression Trees
(CART), randomized node optimization and bagging. Moreover, he introduced an estimate
5
of the generalization error of a forest using out-of-bag errors as well as several useful
concomitant features including proximity matrix, partial dependence plot, and variable
importance ranking.
2.2 Definition of Random Forest Model
Random Forest is defined as a classifier made up of a collection of tree-structured classifiers
{r(X,Φk, Dn), n > 1}, where;
• r(.) is the regresssion or classification function estimate,
• Dn = {(X1, Y1), . . . , (Xn, Yn)} is the training sample of i.i.d. [0, 1]d ×IR-valued ran-
dom variables (d ≥ 2) where the [0, 1]d is equipped with the standard euclidean
metric,
• Φ1,Φ2, ...,Φk are independent and identically (i.i.d) distributed random vectors, and
• X is an input vector conditional on Φk and Dn.
It is a supervised learning algorithm that utilizes an ensemble technique from both classifi-
cation and regression (Berdugo et al., 2020). For classification, each classifier or tree casts
a vote for the most popular class whereas in regression the random vectors are combined
to form the regression estimate denoted as
r(X,Φk) = EΦ [r(X,Φk, Dn)] (2.1)
where EΦ is the expectation with respect to the random parameter that is conditionally on
X and the training sample Dn. The individual random trees are constructed in the follow-
ing procedures; According to Biau (2012), all the tree nodes are related with rectangular
cells with the end goal that at each progression of the tree construction, the assortment of
cells related with the leaves of the tree forms a partition of [0, 1]d. Also, the root of the
6
tree forms a partition of [0, 1]d itself. At each node of the tree construction, a coordinate of
X =(X(1), X(2), ..., X(J)
)is selected, with the probability of the jth feature being selected
given as pnj ∈ (0, 1). Ones the coordinate is selected at each node, the split is then located
at the midpoint of the chosen side. The procedure above is repeated [log2 λn] times where
λn ≥ 2 is a deterministic parameter depending on n and is fixed beforehand by the user.
However, each randomized classifier (tree) rn(X,Φ) produces the average over all Yi for
which the corresponding vectors Xi fall in the same cell of the random partition as X.
This is given as;
rn(X,Φ) =
∑ni=1 Yi1[Xi∈An(X,Φ)]∑ni=1 1[Xi∈An(X,Φ)]
1En(X,Φ) (2.2)
where An(X,Φ) represent the rectangular cell of the random partition containing X and
the event of En(X,Φ) is defined by
En(X,Φ) =
[n∑i=1
1[Xi∈An(X,Φ)] 6= 0
](2.3)
Now, to compute the random forest estimate, we take an expectation of the classifiers
rn(X,Φ) with respect to the parameter Φ. Therefore, random forest regression estimate is
given as;
rn(X) = EΦ [rn(X,Φ)] = EΦ
[∑ni=1 Yi1[Xi∈An(X,Φ)]∑ni=1 1[Xi∈An(X,Φ)]
1En(X,Φ)
](2.4)
Now considering some general remarks about the random forest model, at the tree con-
struction stage, each tree is made up of 2(log2(λn)) ≈ λn number of terminal nodes and each
leaf 2−(log2(λn)) ≈ 1λn
. At each node of each tree, a candiddate variable X(j) is chosen with
probability pnj ∈ (0, 1). Since each variable X(j) has uniform distribution [0, 1]d, it implies
that∑J
j=1 pnj = 1
7
2.3 Characterizing the Accuracy of Random Forests
2.3.1 Convergence of Random Forests
Breiman (2001) defined a margin function of random forest from the ensemble of classifiers
r1(X,Φ), r2(X,Φ), ..., rK(X,Φ), and the training set randomly drawn from the distribution
of the random vector Y,X. The margin function is denoted as;
mg(X, Y ) = avkI(rk(X,Φ) = Y )−maxτ 6=Y
avkI(rk(X,Φ) = τ), k = 1, . . . , K (2.5)
where I(·) is an indicator function. The margin function estimates the degree to which the
mean number of votes at X, Y for the correct class (Y ) surpasses the mean vote in favor
of any other class (τ). The higher the margin, the higher the accurate of the forest.
We now define a generalization error as
PE∗ = PX,Y (mg(X, Y ) < 0) (2.6)
where the probability is with respect to the X, Y space.
Theorem 2.3.1. Following the Strong Law of Large Number (SLLN), as the number of
trees or classifiers increases, for almost surely all the sequences of Φ1,Φ2, ...,Φk, PE∗
converges to
PX,Y
(PΦ(rk(X,Φ) = Y )−max
τ 6=YPΦ(rk(X,Φ) = τ) < 0
)(2.7)
This theorem demonstrates why random forest models do not overfit as the number of
trees increases, but yield a limit of the generalization error, PE∗.
2.3.2 Strength and Correlation of Random Forests
In random forests the two most important properties are the strength and correlation.
These two parameters measure the accuracy of the individual classifiers and the dependence
8
between them, respectively. Breiman (2001) introduced these two properties for calculating
the upper bound of the generalization error (Bernard et al., 2010).
The margin function for a random forest is given as
mr(X, Y ) = PΦ(rk(X,Φ) = Y )−maxτ 6=Y
PΦ(rk(X,Φ) = τ) (2.8)
and the strength of the set of classifiers is also given as
s = EX,Ymr(X, Y ) (2.9)
We assume s ≥ 0 and by applying Chebychev’s inequality gives
PE∗ ≤ var(mr)
s2(2.10)
The variance of mr is derived in the following process;
τ(X, Y ) = argmaxτ 6=Y
PΦ(rk(X,Φ) = τ)
mr(X, Y ) = PΦ(rk(X,Φ) = Y )− PΦ(rk(X,Φ) = τ(X, Y ))
= EΦ [I(rk(X,Φ) = Y )− I(rk(X,Φ) = τ(X, Y ))]
However, the raw margin function is
rmg(Φ, X, Y ) = I(rk(X,Φ) = Y )− I(rk(X,Φ) = τ(X, Y ))
Hence, mr(X, Y ) is the expectation of rmg(Φ, X, Y ) with respect to Φ. This implies that
mr(X, Y )2 = EΦ,Φ′rmg(Φ, X, Y )rmg(Φ′, X, Y ) (2.11)
9
Now,
var(mr) = EΦ,Φ′(covX,Y rmg(Φ, X, Y )rmg(Φ′, X, Y ))
= EΦ,Φ′(ρ(Φ,Φ′)sd(Φ)sd(Φ′))
where ρ(Φ,Φ′) is the correlation between rmg(Φ, X, Y ) and rmg(Φ′, X, Y ), and sd(Φ) is
the standard deviation of rmg(Φ, X, Y ).
ρ represent the mean value of the correlation. Thus,
ρ =EΦ,Φ′(ρ(Φ,Φ′)sd(Φ)sd(Φ′))
EΦ,Φ′(sd(Φ)sd(Φ′))
Therefore,
var(mr) = ρ(EΦsd(Φ))2 ≤ ρEΦvar(Φ) (2.12)
EΦvar(Φ) ≤ EΦ(EX,Y rmg(Φ, X, Y ))2 − s2 ≤ 1− s2 (2.13)
Theorem 2.3.2. The upper bound for the generalization error is derived from the strength
(s) of the individual classifiers and the correlation (ρ) between the classifiers with regards
to the margin functions. The generalization error is given by
PE∗ ≤ ρ(1− s2)
s2(2.14)
In growing each tree, randomly selected inputs or combinations of inputs are used at
each node. However, in order to improve the accuracy of the forest, the randomness infused
has to minimize the correlation (ρ) among trees while maintaining the strength of the trees
or classifiers. The resulting forests would have the following desirable characteristics:
• Its precision is just about as accurate as or sometimes better than Adaboost.
• It is moderately robust to anomalies and noise.
• It is computationally efficient than bagging or boosting.
10
• It produces applicable estimates of error, the strength of classifiers, correlation, and
variable importance.
2.4 Out-of-Bag Sample and Out-of-Bag Error
The idea of using out-of-bag estimates in estimating the generalization error was proposed
by Wolpert and Macready (1999) and Tibshirani (1996). Wolpert and Macready (1999)
proposed the procedures for estimating the bagged predictors generalization error in re-
gression type problems whereas Tibshirani (1996) proposed the method of estimating the
generalization error for random classifiers by using out-of-bag estimates of variance. More-
over, the work of Breiman (1996b) on bagged classifiers’ error estimation provides empirical
evidence to demonstrate that the out-of-bag estimate is equally accurate to using a test set
of the same size as the training set from the data. In the process of growing trees in the
forest, a sequence of bootstrap samples TB,1, TB,2, ..., TB,K of the same size are randomly
generated from the training set T . Thereafter, K trees are constructed such that the kth
tree ϕk(x, TB,k) depends on the kth bootstrap sample.
Assuming there are N rows in the bootstrap training set TB, by random selection, the
probability of not selecting a row is;N − 1
N
By using sampling-with-replacement technique, by random selection, the probability of not
selecting N rows in is; (N − 1
N
)Nhowever, as N becomes large (N →∞),
limN→∞
(N − 1
N
)N= lim
N→∞
(1− 1
N
)N= e−1 = 0.368 ≈ 37%
11
Therefore, aproximately 37% of each bootstrap training set TB (known as the out-of-bag
sample) are left out to form accurate estimates for important quantities such as; to give
much improved estimated probabilities of the nodes and to estimate the generalization
error. This generalization error is refer to as out-of-bag error. In conclusion, the out-of-
bag error estimate eliminates the requirement for setting aside a test set for cross-validation.
2.5 Mean Squared Error (MSE)
As illustrated above, according to the used of random sampling in constructing bootstrap
samples, approximately 37% of the observations are not used for any individual classifier
(tree). This is known as the out-of-bag (OOB) sample for each classifier. These OOB
samples are used to estimate the prediction accuracy of the forest. It is computed as;
MSE(OOB) =1
n
n∑j=1
(yj − yjOOB)2 (2.15)
where n is the total number of observations of the OOB sample, and yjOOBis the average
of the out-of-bag predictions for the jth observation. Moreover, the percentage of variance
explained, R2(OOB) can be obtained by
R2(OOB) = 1−
MSE(OOB)
SST(2.16)
where SST =∑n
j=1 (yj − y)2 is the Sum of Square Total and y denotes the overall mean
of the observations.
For tree t, the out-of-bag mean squared error of each tree is computed as
MSEt(OOB)=
1
nt(OOB)
nt(OOB)∑j=1
(yj − ytj
)2(2.17)
12
where nt(OOB)is the number of observations in the OOB sample of tree t and ytj is the
average of the OOB predictions for the jth observation of tree t. Now, for R number of
runs, the average mean squared error by run can be computed as
AMSEt(OOB)=
1
R
1
nt(OOB)
R∑r=1
nt(OOB)∑j=1
(yj − ytj
)2(2.18)
where R is the total number of runs (iterations).
2.6 Variable Importance
One of the most important objectives of machine learning aside from finding the best and
most accurate model of the response variable is to identify among the predictor variables
the predictors that are most important and relevant to make predictions. This enables
one to have an in-depth understanding of the problem under study. Random forest models
provide several methods for assessing the significance of predictors that best explain a
response variable and subsequently improves the interpretability of the model (Louppe,
2014).
In random forest models as well as other tree-based models, a naive measure of selecting
important variables is to simply check the number of times each predictor is selected
by all individual trees in the forest (Strobl et al., 2007). Friedman (2001) presented a
more detailed variable importance measure that incorporates a (weighted) mean of each
individual classifier’s improvement in splits produced by each variable. An example is the
Gini Importance that is available in random forest implementation for variable importance
measure in classification and average impurity reduction for regression trees. Breiman
(2001, 2002) proposed how to determine the importance of a variable Xj for predicting a
response variable Y by adding up the decreases of the weighted impurity p(t)4 i(st, t) for
every node t where variable Xj is used, and average them over all trees NT in the forest
13
(Louppe et al., 2013). This is given as;
Imp(Xj) =1
NT
∑T
∑t∈T :v(st)=Xj
p(t)4 i(st, t) (2.19)
where v(st) denote the variable used for splitting node t and the proportion Nt
Nof samples
reaching t is denoted by p(t). This measure is referred to as Mean Decrease Gini or Gini
Importance or generally known as Mean Decrease Impurity importance (MDI).
A very popular variable importance measure for regression forests is the average impurity
reduction. However, permuted accuracy importance measure is the most advanced measure
for selecting important variables in random forests. The idea of measuring the Mean
Decrease Accuracy (MDA) or Mean Increase Error of the forest to evaluate the importance
of variable Xj was also proposed by (Breiman, 2001, 2002). This measure is also referred
to as Permutation Importance since the values of the variable Xj are randomly permuted
in the out-of-bag samples (Louppe et al., 2013).
2.7 Problems in High Dimensions
As mentioned earlier, the problem of high dimension arises when the dataset contains a
huge number of features and only a small percentage of them are truly informative. This
results in the degeneration of the accuracy of the base classifiers (trees). Since at each node
simple random sampling is employed in selecting the subset of m candidate variables, non-
informative features would dominate the random subset of candidate splitting variables.
For instance, considering data, D = {(xi,yi) ∈ IRp × IR}ni=1 with total number of features
Θ, which contains only ω number of informative features, we want to compute the expected
number of informative features in a default mtry value in building a forest. Let η be the
number of features selected at each node of the tree by random sampling with equal
weights or probabilities. Then the probability distribution of selecting the number of
informative features (x) follows a binomial distribution with η number of trials. Probability
14
of selecting informative feature is p = ωΘ
and probability of selecting non-informative
features is q = 1− ωΘ
.
x ∼ Binomial(η, p)
Pr(x, η, p) =
(η
x
)pxqη−p
At each iteration, the expected number of informative features selected is µ = ηp. Since,
the proportion of informative is small, p < q, hence the expected number µ would also be
small.
For example, given that Θ = 10, 000 and ω = 100,
p =ω
Θ=
100
10000= 0.01
For regression we have that, η1 = Θ3
= 100003
= 3333
Similarly, for classification, η2 =√
Θ =√
10000 = 100
Where η1 and η2 are the default mtry values associated with regression and classification
trees respectively.
In the case of classification, E(x) = η2p = 100(0.01) = 1, and that of regression, E(x) =
η1p = 3333(0.01) ≈ 33. This implies that, the expected number of informative features
per node from the default mtry (thus candidate variables) is 1 and 33 for classification
and regression trees respectively. Therefore, the resultant base classifiers would have low
accuracies and overall affect the performance of the forest.
To better undertand this effect, we performed a simulation study to illustrate the per-
formance of random forest model in high dimensional data by analyzing the prediction
accuracy (Mean Square Errors) of RF. A training data D1 of size n = 500 and a test data
D2 of size n = 5000 were generated from a linear model, a nonlinear model (Multivariate
Adaptive Regression Splines model, MARS) and a tree model.
The first 5 predictors (x1, x2, . . . , x5) were considered the true variables and {xi}505i=6 were
noise variables where {xi}505i=6 ∼ N(0, 1). For each of the simulation models, two random
15
forest models were fitted, thus the oracle model and the naive model. The oracle model
was fitted with only the true predictors (p = 5) and the naive model was fitted with all
the predictors (p = 505) with 50 iterations. For each RF model (oracle and naive), we
considered a case where there is no correlation between covariates and the case where there
is correlation between covariates (ρ = 0) and (ρ = 0.5).
2.7.1 Simulation Models
For the purpose of performing a simulation study on the problem, data were simulated
from the three models below;
yi = β0 + β1x1 + β2x2 + β3x3 + β4x4 + β5x5 + εi (2.20)
yi = 10 sin(πx1x2) + 20
(x3 −
1
2
)2
+ 10x4 + 5x5 + εi (2.21)
yi = 5I(X1 ≤ 1) ∗ I(X2 ≤ 0.2) + 4I(X3 ≤ 1) ∗ I(X4 ≤ −0.2) + 3I(X5 ≤ 0) + εi (2.22)
where εi ∼ N(0, 1) and I(·) is an indicator function.
Model (2.20) is made up of a linear relationship between the true predictor variables
x1, x2, . . . , x5 and the response variable to examine the performance of RF model on high-
dimensional data consisting of a linear relationship between variables. The true specified
coefficients of model 2.20 were 0.5, 0.1,−0.4, 0.5, 1, 3 respectively. Covariates were gener-
ated with the standard normal distribution, {xi}505i=6 ∼ N(0, 1).
In model (2.21), the response variable Y is simulated using the Multivariate Adaptive Re-
gression Splines (MARS) model by Friedman (1991) with nonlinear association and variable
interactions among predictors to examine the performance of RF in high-dimensional data
when the target variable has a linear and nonlinear additive dependence. Finally, the re-
sponse variable of model (2.22) was simulated using the tree function. This model was also
employed in simulation to analyze and examine the performance of random forest model
in high-dimensional data when the response variable is modeled with tree function.
16
Graphical Display of Simulation Results
Figure 2.1: Prediction MSE for linear model when ρ = 0
Figure 2.2: Prediction MSE for linear model when ρ = 0.5
17
Figure 2.3: Prediction MSE for MARS model when ρ = 0
Figure 2.4: Prediction MSE for MARS model when ρ = 0.5
18
Figure 2.5: Prediction MSE for tree model when ρ = 0
Figure 2.6: Prediction MSE for tree model when ρ = 0.5
For each generated data set, two types of random forests were built. For the first type
that we refer to as the ’oracle model’, only those truly important variables were used. For
the second type that we call the ’naive model’, all variables were blindly used. The MSE
values were monitored during the training process of random forest as the number of trees
accumulates. A test sample of size n = 5, 000 were also sent down the fitted RF models
along the training process to obtain the prediction MSE measures. The resultant MSE
19
and prediction MSE measures for different models were plotted versus the number of trees
in the forest in Figures 2.1 − 2.6. The results allow us to inspect how both RF methods
perform and compare to each other as the number of trees increases.
In Figures 2.1 − 2.6, mse0 and mse represent the MSE values for the oracle and naive
models respectively whereas amse0 and amse represent the average MSE values over simu-
lation runs. From the graphs (Figures 2.1 − 2.6), we could examine that the models with
only the true predictors, oracle models (p = 5) have smaller prediction mean square errors
(light blues for runs and deep blue for average of runs by trees) compared to the models
containing the true predictors and noise variables, naive models (p = 505) (light reds for
runs and deep red for average of runs by trees). Moreover, the same true predictors were
used in both models but in the presence of noise variables, the performance of RF declined
substantially regardless of whether there is a correlation between covariates or not.
2.8 Related Works
In recent times, many studies on the problems associated with high dimensions have been
conducted by researchers. As mentioned earlier, having a small portion of important pre-
dictors mixed with massive noise variables is one factor that contributes to the problems of
high dimensions. Using simple random sampling in selecting the subset of features at each
node would likely contain huge amount of non-informative features and leads to a degrade
in the performance of the base learners (trees). A method to remedy this problem was
proposed by Amaratunga et al. (2008). Instead of simple random sampling, this method
introduced a weighted random sampling method at each node of each tree. According to
their approach, weights {wi} were imposed on the genes (variables) so that less informative
genes are less likely to get selected. The odds of trees containing more informative features
being included in the forest increases. Features were scored by testing each feature for a
group mean effect using a two-sample t-test and q-value (Storey and Tibshirani, 2003) cal-
culated from the p-values as: qi = mink≥1
{min((Θ
k)p(k), 1)
}where Θ is the total number
20
of features, p(k) are the p-values and q(i) are the q-values related to the feature with the
k-th smallest p-value.
However, the incorporation of co-data in replacing the uniform sampling probabilities has
been used in some recent works of high-dimensional random forests. Te Beest et al. (2017)
introduced Co-data moderated Random Forest (CoRF) where co-data are incorporated in
random forest by replacing the uniform sampling probabilities that are used to draw can-
didate variables with co-data moderated sampling probabilities (by using empirical Bayes
estimation).
Random forest has proven to have low performance when dealing with very high-dimensional
data that consists of dependencies. The general random forest method does not efficiently
exploit the cases where there exist many combinations among variables as a result of de-
pendencies among variables. Do et al. (2010) extended the initial idea of Do et al. (2009) by
proposing a random oblique decision tree method that uses linear proximal SVMs to per-
form multivariate node splitting at the tree growing stage in order for the trees to make use
of dependencies among features. This method was successful in producing stronger indi-
vidual trees than the usual forest method that picks only a single feature for node splitting
at each node. Their proposed method has significant improvement on the performance
of random forest on very high-dimensional data and slightly better on lower dimensional
data.
High-dimensional datasets often consist of highly correlated features which affect random
forest’s ability to identify the strongest features by reducing the estimated variable im-
portance scores of correlated features (Gregorutti et al., 2017). Gregorutti et al. (2017)
proposed a solution known as Random-Forest-Recursive Feature Elimination (RF-RFE)
algorithm for variable selection criterion (Darst et al., 2018). Permutation variable im-
portance scores was employed as a ranking criterion to recursively eliminates variables
with backward elimination strategies. Wright and Ziegler (2015) developed a fast imple-
mentation package for high diemnsioanl data in C++ and R known as RANdom forest
GEneRator (Ranger). This package focuses more on the implementation efficiency and
21
memory issues when dealing with high dimensions. A recent implementation of package
for high-dimensional random forest available in R is the Rborist package (Seligman, 2015)
and a novel software package known as Random Jungle (RJ) available in C++ (Schwarz
et al., 2010). RJ was written in C++ language and it was implemented with all the fea-
tures in the original randomForest implementation by Breiman and Cutler (2004) written
in Fortran 77.
22
Chapter 3
High-Dimensionl Random Forests
with Ridge Regression Based
Variable Screening
In this chapter, we present the proposed method for developing random forest models
with high-dimensional data or data that contain a large amount of noise variables. In
this project, we try to come up with a novel feature screening method before fitting a
random forest model on basis of ridge regression. A high dimensional data will often have
many candidate variables for splitting variable selection. Most of these candidate variables
would be noninformative or have no predictive power. Therefore, we propose using ridge
estimators as a variable screener and applying original random forest to a top portion of
the screened features.
3.1 Ridge Estimator
The multicollinearity problems associated with the ordinary least square estimate moti-
vated the idea of ridge regression. Ridge regression basically introduces an `2 penalty on
the regression coefficients to penalize the least square loss.
Considering the general linear regression model;
Y = β0 +
p∑i=1
Xiβi + εi (3.1)
23
where X1, . . . , Xp are p predictor variables and εi are random errors with mean 0 and vari-
ance 1 (εi ∼ N(0, 1)). The goal of this linear model is to estimate the unknown regression
parameters β = (β0,β1, . . . ,βp)T .
Recall that the ordinary least square fitting procedure estimates of the regression param-
eters β = (β0,β1, . . . ,βp)T by minimizing the residual sum of square;
RSS = minn∑i=1
(yi − β0 −
p∑j=1
βjxij
)2
(3.2)
with absurdly high variance due to the singularity of matrix XTX. The least square esti-
mates can explode with absurdly high variance due to the singularity of matrix XTX. To
control these variations, a ridge contraint (penalty) is imposed to regularize the coefficients
and regularize how large the coefficients grow.
The ridge regression coefficient estimates βR =(βR0 , β
R1 , . . . , β
Rp
)Tare the values that
minimize the resisual sum of squares;
minn∑i=1
(yi − β0 −
p∑j=1
βjxij
)2
+ λ
p∑j=1
β2j
= RSS + λ
p∑j=1
β2j
(3.3)
where λ ≥ 0 is the tuning parameter .
The above penalized LS form can be equivalently written into the constraint optimization
problem as follows,
minn∑i=1
(yi − zTi β
)2s.t.
p∑j=1
β2j ≤ t
⇔ min (y − Zβ)T (y − Zβ) s.t.
p∑j=1
β2j ≤ t
(3.4)
24
where Z is standardized variables with mean 0, and variance 1.
Following the penalized residual sum of squares;
PRSS(β) =n∑i=1
(yi − zTi β
)2+ λ
p∑j=1
β2j
= (y − Zβ)T (y − Zβ) + λ||β||22
(3.5)
Taking the derivative, we obtain
∂PRESS(β)
∂β= −2ZT (y − Zβ) + 2λβ set
=0
ZT (y − Zβ)− λβ = 0
ZTy − β(ZTZ − λ) = 0
βRλ = (ZTZ + λIp)−1ZTy
where I is a p× p identity matrix.
Hence the ridge estimator of β is of the form
βRλ = (ZTZ + λIp)−1ZTy (3.6)
3.2 Ridge Penalty (λ)
The solution of the ridge estimator (3.6) is indexed by the ridge penalty parameter also
known as the tuning parameter λ, thus, each λ value produces a unique solution. This
penalty shrinks the coefficient estimate to 0 as the penalty paramter increases.
λ is the shrinking parameter and it does the following;
• the size of the ridge coefficients is controlled by λ
• λ also controls the amount of regularization
• the least square solution is obtained as λ→ 0. Thus, limλ→0
βRλ = βR0 = (XTX)−1XTY
25
• βRλ=∞ = 0 as λ→∞, (the model becomes an intercept-only model)
As clearly listed above, the tuning parameter λ plays an important role in the solution of
PRSS(β). However, in our proposed method we set the tuning parameter fixed at λ = ε
such that ε is any small number close to 0. A smaller λ value will lower the effect of
multicolinearity and at the same time produce an estimate that is as good as the Least
Square Estimate (LSE). This works because the ranking of variables is invariant to the
choice of λ as long as it is within some reasonable range in probabilistic sense. (More
justification would be provided in the next section). Also, initializing a fixed λ value helps
to improve the computational efficiency of the algorithm.
3.3 Variable Screening
Our interest is using ridge regression to screen the predictors in order to get rid of some noise
predictors that lead to the low performance of random forest model in high-dimensions.
The variable screening procedure consists of an initial stage that is effective in ranking the
predictors in order of importance. At the initial stage, a ridge regression model is fitted
on the entire data. The ridge estimates (coefficients) are sorted in thier absolute values
such that |βRj | ≥ |βRj′ | for j 6= j′ and a top portion (say α1%) of the predictors are selected
based on the sorted coefficients. In other words, (100− α1)% of the variables is truncated
in ascending order. After the initial stage of variable screening, random forest can be ap-
plied on the selected relevant variables. However, the algorithm is flexible in allowing for
multiple stages of variable screening depending on the dimension of variables and if the
selected portion is still dominated with noise variables.
In the second stage of the algorithm, another ridge model is fitted on the selected α1%
variables and the ridge estimates (coefficients) are sorted such that |βRj | ≥ |βRj′ | for j 6= j′.
At this stage, α2% of the predictors are selected based on the sorted coefficients. Equiv-
alently, (100 − α2)% of the predictors is truncated from the minimum to the maximum.
Therefore, the remaining αγ% of the variables will contain more relevant and less uninfor-
26
mative variables for a better random forest model. The flexibility of the algorithm allows
for multiple stages (M) of variable screening and the parameters {α1, α2, . . . , αM} can be
adjusted by the user.
Conjucture 1. Assuming the predictors xj’s are all normalized, if |βj| ≥ |βj′ |, then with
high probability, |βRj | ≥ |βRj′ | for any λ > 0 that may satisfy some conditions.
27
Input: Data, D = {(xi, yi) ∈ IRp × IR}ni=1
Data: Training data D1 = {(xi, yi) ∈ D} for ridge regression
1 begin
2 Using Training Set D1
3 set tuning parameter λ = ε
4 set percentage p = α1, . . . , αM
5 for p→ 1 to M do
6 Fit a Ridge Regression model RG
7 Extract the coefficients βRi , for i = 1, ..., P
8 Sort |βRi | such that βRI,...,J = |βRi | > |βRj | for i 6= j
9 Select first p(m) of βRI,...,J
10 Extract features from D1 w.r.t selected βRI,...,J thus D11
11 Using new Training Set D11
12 Repeat steps (6)− (10) until p = αM in step (9)
13 Set final reduced data to Dred1 for Training Set
14 end
15 end
Data: New Training data Dred1 = {(xi, yi) ∈ D1}
16 set number of trees, ntree = T
17 for t→ 1 to T do
18 Fit a Random Forest model RF with T trees using Dred1
19 end
Result: Random Forest Model RF with αγ% important Ridge Regression features
Algorithm 1: Algorithm for the Proposed Method.
28
Chapter 4
Simulation Study
In this chapter, we present a series of simulation studies to explore and investigate the
performance of the proposed methods. We illustrate the capability of ridge regression in
screening variables and apply random forest to a top α% of relevant predictors. The aim
is to demonstrate the effectiveness of the proposed method in handling high-dimensional
data. Datasets of high dimensions with different sizes and characteristics were generated
and used in these experiments. The results were compared to the available standard
random forest method via Prediction Mean Square Errors (PMSE).
4.1 Simulation Models
A high-dimensional data with training data D1 of size n = 500 and a test data D2 of size
n = 5000 were generated from the linear model (2.20) and tree model (2.22) presented in
Section 2.7.1.
The first 5 predictors (x1, x2, . . . , x5) were considered the true variables. For each simu-
lation model, 500 additional noise predictors {xi}505i=6 were randomly generated using the
standard normal distribution, thus N(0, 1). These predictors were considered to be noise
predictors because they have no or less significant contribution in predicting the response
variable Y . For each of the simulation models, three random forest models were fitted, i.e.,
the oracle model, naive model and the proposed model. The oracle model was fitted with
only the true predictors (p = 5), the naive model was fitted with all the predictors (p = 505)
and the proposed model was fitted with the screened variables. For each RF model (oracle,
naive and proposed), we considered a case where there is no correlation between covariates
29
and the case where there is correlation between covariates, that is ρ ∈ {0, 0.5}. A total of
100 simulation iterations (runs) were performed with both the training set and test set of
data by growing 500 trees for each run.
Two variable screening stages of the algorithm were performed for the proposed model.
At the first stage, 30% of the predictors were selected based on the sorted ridge regres-
sion coefficients |βRj | ≥ |βRj′ | for j 6= j′, (i.e., α1 = 30%). In the second stage of variable
screening, 50% out of the initial selected 30% of the predictors were selected based on the
|βRj∗| ≥ |βRj∗′ | for j∗ 6= j∗′, (i.e., α2 = 50%). The variable screening resulted in a total
selection of 15% of the data D1 and D2 that contains higher proportion of informative
variables, 50100
(30100D)
= 15100D = 15%D. After the second stage of variable screening, the
randomForest package in R was applied on the 15% relevant variables; see Liaw et al.
(2002), RColorBrewer and Liaw (2018) for random forest implementation in R.
4.2 Comparing Performance of the Proposed Model
to the Existing Model
The performance of the proposed method is compared to that of the classical random
forest method. In this regard, we employed the prediction mean square errors (MSE) for
the runs and the average prediction mean square errors (AMSE) as performance measures.
The MSE is computed by equation (2.15) and the AMSE is computed by equation (2.18)
both presented in Section 2.5. For each of the models (oracle, naive and proposed), their
mean squared errors were computed by number of trees (T = 500) and runs (R = 100).
Table 4.1, 4.2, 4.3 and 4.4 present the comprehensive results on the average mean squared
error (2.18) by runs R. Figure 4.1, 4.2, 4.3 and 4.4 presented below also gives a visual
presentation of the mean squared errors by runs and trees (2.17) and the average mean
squared errors by runs (2.18) for the linear model and figure 4.5, 4.6, 4.7 and 4.8 present
that of the tree model. The red, green and blue points represent the MSEs and AMSEs of
30
the naive model, oracle model and the proposed model respectively.
It can be observed from the tables and figures presented below that the MSE and AMSE
values of the proposed model are smaller than the oracle and naive model. Also, the MSE
stabilizes as the number of tree increase, (ntrees→∞). This gives the evidence that gener-
ally the performance of a forest improves with large number of trees. In terms of precision
and performance, it can be clearly observed that the proposed method performs better
than the existing random forest method in dealing with high dimensional data. Compar-
ing the cases where ρ ∈ {0, 0.5}, the proposed method has a better performance regardless
of the presence of multicolinearity. Moreover, by the mean squared errors (2.15), it is clear
that applying ridge regression in variable screening produces a top relevant/informative
variables for a better random forest model in the high dimensional data setting.
31
4.2.1 Simulation Results from Linear Model
Table 4.1: Average Prediction MSE Values for linear model 2.20 with ρ = 0
Average Prediction MSE Values for ρ = 0, runs = 100Oracle Naive Proposed
Trees Run amse0 amse0.test amse amse.test amse1 amse1.test1 1 9.04 8.62 9.51 9.14 8.82 8.602 1 7.91 5.26 8.74 5.79 7.91 5.393 1 7.23 4.31 7.91 4.64 6.95 4.174 1 6.75 3.85 7.45 4.18 6.35 3.695 1 6.22 3.56 6.94 3.84 5.94 3.42...
......
......
......
...249 1 2.74 2.41 2.95 2.56 2.44 2.24250 1 2.74 2.41 2.95 2.56 2.44 2.24251 1 2.74 2.41 2.95 2.56 2.44 2.24252 1 2.74 2.41 2.95 2.56 2.44 2.24
......
......
......
......
497 1 2.71 2.41 2.92 2.55 2.41 2.22498 1 2.71 2.41 2.92 2.55 2.41 2.22499 1 2.71 2.41 2.92 2.55 2.41 2.22500 1 2.71 2.41 2.92 2.55 2.41 2.22
Figure 4.1: Prediction MSE and AMSE for the simulation runs of the three models: Naive(mse, amse), Oracle (mse0, amse0), and Proposed (mse1, amse1) for training dataset withρ = 0.
32
Figure 4.2: Prediction MSE and AMSE for the simulation runs of the three models: Naive(mse, amse), Oracle (mse0, amse0), and Proposed (mse1, amse1) for test dataset withρ = 0.
Table 4.2: Average Prediction MSE Values for linear model 2.20 with ρ = 0.5
Avarage Prediction MSE Values for ρ = 0.5, runs = 100Oracle Naive Proposed
Trees Run amse0 amse0.test amse amse.test amse1 amse1.test1 1 6.06 5.97 7.58 7.52 6.73 6.772 1 5.71 4.08 6.89 4.86 5.84 4.263 1 5.29 3.43 6.39 3.94 5.46 3.444 1 4.87 3.10 5.88 3.49 5.02 3.045 1 4.54 2.91 5.35 3.18 4.61 2.81...
......
......
......
...249 1 2.13 2.15 2.14 2.17 1.84 1.89250 1 2.13 2.15 2.14 2.17 1.84 1.89251 1 2.13 2.15 2.14 2.17 1.84 1.89252 1 2.13 2.15 2.14 2.17 1.84 1.89
......
......
......
......
497 1 2.11 2.14 2.12 2.16 1.82 1.88498 1 2.11 2.14 2.12 2.16 1.82 1.88499 1 2.11 2.14 2.12 2.16 1.82 1.88500 1 2.11 2.14 2.12 2.16 1.82 1.88
33
Figure 4.3: Prediction MSE and AMSE for the simulation runs of the three models: Naive(mse, amse), Oracle (mse0, amse0), and Proposed (mse1, amse1) for training dataset withρ = 0.5.
Figure 4.4: Prediction MSE and AMSE for the simulation runs of the three models: Naive(mse, amse), Oracle (mse0, amse0), and Proposed (mse1, amse1) for test dataset withρ = 0.5.
34
4.2.2 Simulation Results from Tree Model
Table 4.3: Average Prediction MSE Values for tree model 2.22 with ρ = 0
Average Prediction MSE Values for ρ = 0, runs = 100Oracle Naive Proposed
Trees Run amse0 amse0.test amse amse.test amse1 amse1.test1 1 7.65 7.83 10.10 10.00 8.91 8.932 1 7.16 4.96 9.61 6.43 7.93 5.453 1 6.58 4.02 9.06 5.28 7.62 4.504 1 6.07 3.55 8.34 4.59 7.02 3.955 1 5.54 3.23 7.59 4.13 6.45 3.57...
......
......
......
...249 1 2.33 2.11 3.05 2.57 2.50 2.21250 1 2.33 2.11 3.04 2.56 2.50 2.21251 1 2.33 2.11 3.05 2.56 2.50 2.21252 1 2.33 2.11 3.05 2.57 2.50 2.21
......
......
......
......
497 1 2.30 2.09 3.01 2.55 2.46 2.20498 1 2.30 2.09 3.01 2.55 2.46 2.20499 1 2.30 2.09 3.00 2.55 2.46 2.20500 1 2.30 2.09 3.00 2.55 2.46 2.20
Figure 4.5: Prediction MSE and AMSE for the simulation runs of the three models: Naive(mse, amse), Oracle (mse0, amse0), and Proposed (mse1, amse1) for training dataset withρ = 0.
35
Figure 4.6: Prediction MSE and AMSE for the simulation runs of the three models: Naive(mse, amse), Oracle (mse0, amse0), and Proposed (mse1, amse1) for test dataset withρ = 0.
Table 4.4: Average Prediction MSE Values for tree model 2.22 with ρ = 0.5
Average Prediction MSE Values for ρ = 0.5, runs = 100Oracle Naive Proposed
Trees Run amse0 amse0.test amse amse.test amse1 amse1.test1 1 6.97 6.88 8.48 8.39 7.28 7.262 1 6.31 4.46 7.58 5.20 6.60 4.653 1 5.72 3.57 7.01 4.15 6.02 3.704 1 5.28 3.19 6.55 3.69 5.56 3.275 1 4.92 2.96 6.13 3.40 5.14 2.99...
......
......
......
...249 1 2.04 2.00 2.26 2.13 1.99 1.94250 1 2.04 2.00 2.26 2.13 1.99 1.94251 1 2.04 2.00 2.26 2.13 1.99 1.94
......
......
......
......
497 1 2.01 1.99 2.23 2.11 1.96 1.93498 1 2.01 1.99 2.23 2.11 1.95 1.93499 1 2.01 1.99 2.23 2.11 1.95 1.93500 1 2.01 1.99 2.23 2.11 1.95 1.93
36
Figure 4.7: Prediction MSE and AMSE for the simulation runs of the three models: Naive(mse, amse), Oracle (mse0, amse0), and Proposed (mse1, amse1) for training dataset withρ = 0.5.
Figure 4.8: Prediction MSE and AMSE for the simulation runs of the three models: Naive(mse, amse), Oracle (mse0, amse0), and Proposed (mse1, amse1) for test dataset withρ = 0.5.
37
Chapter 5
Real Data Exploration
This chapter presents the analysis results on a well known dataset sourced from the
UCI Machine Learning Resipository to illustrate the proposed method in handling high-
dimensional data. Though the original data are not of high dimension, a large number of
noise (redundant) variables simulated from the standard normal distribution were added
to the dataset. The results from the experiment indicate that the proposed method signif-
icantly outperform Breiman (2001) original random forest algorithm.
5.1 Communities and Crime Dataset
The communities and crime dataset is a multivariate data that combine socio-economic
data from the 1990 US Census, law enforcement data from the 1990 US LEMAS survey,
and crime data from the 1995 FBI UCR. It involves communities within the United States.
However, the dataset is made up of many variables that were related to the target variable.
The attributes were picked based on their plausible connections to crime. It consists of
122 predictors and the attribute to be predicted (Per Capita Violent Crimes). The fea-
tures included in the crime dataset involve the community, such as the percentage of the
population considered urban, the median family income, and involving law enforcement,
such as per capita number of police officers, and percentage of officers assigned to drug
units. The target variable was calculated using the population of the communitites and
variables that are considered violent crimes in the United States, such as murder, rape,
assault and robbery. The data was normalized into the decimal range 0.00− 1.00 using an
unsupervised equal-interval binning method. Additional variables {Xi}700i=1 were randomly
38
generated from the standard normal distribution, {Xi}700i=1 ∼ N(0, 1). These variables are
unrelated or uninformative to the target variable, hence are considered as noise variables.
The final data consisted of 822 variables including 1 response variable with 1994 observa-
tions (cases). The dataset was thoroughly cleaned for analyses purposes and partitioned
into a training set and test set in that ration 2 : 1. Variables with less than 50% missing
values were imputated using the Multivariate Imputations by Chained Equations (MICE)
R package, whereas variables with more that 50% missing values were removed.
The analysis were carried out with our proposed method and its performance was compared
to that of the standard random forest method. Performance measures such as predicted
mean square errors (MSE) 2.15 and percentage of variation explained 2.16 were used.
Ridge estimator was applied in variable screening with α1 = 30% and α2 = 50%, to obtain
the informative predictors. In each stage of variable screening, predictors were selected
based on the sorted ridge estimates, |βRj | ≥ |βRj′ | for j 6= j′. However, the selected 15%
of the total variables hold more predictive power for a better random forest model. Then
randomForest package in R was applied on the 15% relevant variables. The performance
of the proposed method and existing method were asseessed via the mean squared errors
and percentage of variation explained by the models. In summary, the naive random forest
method was fitted with all the predictors (p = 821) whereas the proposed random forest
method was with 15% relevant variables, that is p = 123 variables.
39
5.1.1 Presentation of Results from Real Data Application
Table 5.1: Results of analysis of Communities and Crime Dataset. Comapring the proposedmethod to the existing (naive) random forest method in dealing with high dimensions.
Models Dataset No. of Trees MSE % Var Explained
Training Set 500 0.02052 61.85Naive Method
Test Set 500 0.0500 12.24
Training Set 500 0.01952 63.72Proposed Method
Test Set 500 0.0200 65.24
Table 5.2: Prediction MSE values of training and test set for the proposed method andexisting (naive) random forest method.
Prediction MSE Values by TreesNaive Method Proposed Method
Trees Training MSE Test MSE Training MSE Test MSE1 0.0448 0.0663 0.0369 0.04282 0.0420 0.0525 0.0353 0.03043 0.0438 0.0484 0.0340 0.02754 0.0424 0.0541 0.0333 0.02545 0.0397 0.0550 0.0329 0.0225...
......
......
249 0.0207 0.0476 0.0197 0.0193250 0.0207 0.0475 0.0197 0.0193251 0.0207 0.0474 0.0197 0.0193252 0.0207 0.0474 0.0197 0.0193
......
......
...497 0.0205 0.0485 0.0195 0.0192498 0.0205 0.0484 0.0195 0.0192499 0.0205 0.0484 0.0195 0.0192500 0.0205 0.0484 0.0195 0.0192
40
Figure 5.1: Out-of-Bag Estimate Errors of both models
Figure 5.2: Comparing Prediction MSE for Training and Test Sets
Tables 5.1 and 5.2 present the performance measures (mean squared errors and per-
centage of variations explained) of the existing RF model (mse1) and the proposed method
(mse2). These results can be better visualized in figure 5.1 and 5.2. It is clearly observed
41
from the results above (figure 5.1 and 5.2) that the out-of-bag estimates of errors and MSEs
of the proposed method (i.e., mse2) are lesser than that of the existing method (naive, i.e.,
mse1) by Breiman (2001). The percentage of variation explained after variable screening
via ridge estimator increases significantly as compared to the existing method.
42
Chapter 6
Discussion and Conclusion
6.1 Summary of Findings
In this study, we have presented a variable screening method for informative feature se-
lection associated with high dimensional data when building random forest models. This
new random forest algorithm integrates ridge regression as a variable screening tool to
obtain informative predictors in the setting of high dimensions. The algorithm allows for
multiple stages (M) of variable screening by specifying a percentage (αM) of variable to be
selected each stage. It can be extended to multi-class classification problems and data that
contains noisy variables to retain a higher proportion of informative variables for improved
RF modeling.
A preliminary simulation study was carried out to illustrate and assess the problems associ-
ated with high-dimensional random forests. The study has demonstrated that the presence
of massive noisy variables significantly affects the performance and precision of random for-
est. This result supports the works of Xu et al. (2012), Amaratunga et al. (2008), and Darst
et al. (2018) on how the effects of data dominated with a large number of uninformative
variables and correlated predictors can significantly influence model performances. The
empirical results have revealed how the small number of expected informative variables in
an mtry value could lead to the challenge of obtaining an accurate or robust random forest
model, which supports the findings of Amaratunga et al. (2008).
An extension of simulation studies was further employed to test how our proposed algo-
rithm maneuvers this issue by first screening variables to retain relevant variables without
compromising but rather improving the performance of random forest models. To better
43
understand how our method works, we also included the naive and oracle methods, which
allows us to make meaningful comparisons via some performance measures (e.g., MSE).
Based on the performance measures of these models (mean squared errors), the proposed
method clearly outperforms the naive random forest approach.
In an attempt to further validate the results, we applied our proposed method to a real
life dataset (Communities and Crime Dataset), which was sourced from the UCI database.
From the results presented in table 5.1 and 5.2, there was a significant improvement in the
mean squared errors. In particular, the percentage of variation explained by the proposed
method compared to naive model was quite interesting as we saw a a remarkable increase
in R2 from 61.85% to 63.72% for the training set and 12.24% to 65.24% with the test set.
In conclusion, our results have shown how variable screening using ridge regression could
be a very useful tool for handling high-dimensional random forests. The revelance of this
work serves as a foundation to further research on variable screening at the tree and splits
levels of random forest implementation.
6.2 Future Works
One limitation of our proposed method is when there is a nonlinear association between the
predictors and response or target variable. The performance of the method is compromised
when there is nonlinear and interaction effect among variables. We considered one ad hoc
approach which is to include spline basis terms for each variable with the same number of
degrees of freedom (v), like fitting Generalized Additive Model (GAM) but it remains a
linear model. Here the df (v) can be treated as a tuning parameter. However, due to time
constraints, this limitation will be addressed in future work.
Moreover, in this research we considered a variable screening at the forest level which is
a preliminary step to future studies outlined below. Future works will seek to implement
tree-level and split-level variable screening in HD random forests. We plan to continue
developing this research work in the directions below:
44
1. (Tree-Level): By repeating the ridge estimators approach with each or a few boot-
strap samples in random forests and combine results with ensemble learning (by using
rpart() package in R).
2. (Split-Level): By applying variable screening/selection/weighting to each split. The
selection can be made to variables or splits and for nonlinear cases we can include
spline terms.
In conclusion, these further studies would go a long way to implement a more robost
random forest model that could better handle high dimensional data.
45
References
Amaratunga, D., Cabrera, J., and Lee, Y.-S. (2008). Enriched random forests. Bioinfor-
matics, 24(18):2010–2014.
Amit, Y. and Geman, D. (1997). Shape quantization and recognition with randomized
trees. Neural computation, 9(7):1545–1588.
Berdugo, E., Bustos, D., Gonzalez, J., Palacio, A., Perez, J., and Rocha, E. (2020). Iden-
tification and comparison of factors associated with performance in the saber pro and
saber tyt exams for the period 2016-2019.
Bernard, S., Heutte, L., and Adam, S. (2010). A study of strength and correlation in
random forests. In International Conference on Intelligent Computing, pages 186–191.
Springer.
Biau, G. (2012). Analysis of a random forests model. The Journal of Machine Learning
Research, 13(1):1063–1095.
Breiman, L. (1996a). Bagging predictors. Machine learning, 24(2):123–140.
Breiman, L. (1996b). Out-of-bag estimation.
Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.
Breiman, L. (2002). Manual on setting up, using, and understanding random forests v3.
1. Statistics Department University of California Berkeley, CA, USA, 1:58.
Capitaine, L., Genuer, R., and Thiebaut, R. (2020). Random forests for high-dimensional
longitudinal data. Statistical Methods in Medical Research, page 0962280220946080.
46
Darst, B. F., Malecki, K. C., and Engelman, C. D. (2018). Using recursive feature elim-
ination in random forest to account for correlated variables in high dimensional data.
BMC genetics, 19(1):1–6.
Denil, M., Matheson, D., and De Freitas, N. (2014). Narrowing the gap: Random forests in
theory and in practice. In International conference on machine learning, pages 665–673.
PMLR.
Dietterich, T. G. (1998). An experimental comparison of three methods for constructing
ensembles of decision trees: Bagging, boosting and randomization. Machine learning,
32:1–22.
Dietterich, T. G. (2000). An experimental comparison of three methods for constructing
ensembles of decision trees: Bagging, boosting, and randomization. Machine learning,
40(2):139–157.
Do, T.-N., Lallich, S., Pham, N.-K., and Lenca, P. (2009). Un nouvel algorithme de forets
aleatoires d’arbres obliques particulierement adapte a la classification de donnees en
grandes dimensions. In EGC, pages 79–90.
Do, T.-N., Lenca, P., Lallich, S., and Pham, N.-K. (2010). Classifying very-high-
dimensional data with random forests of oblique decision trees. In Advances in knowledge
discovery and management, pages 39–55. Springer.
Friedman, J. H. (1991). Multivariate adaptive regression splines. The annals of statistics,
pages 1–67.
Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine.
Annals of statistics, pages 1189–1232.
Gregorutti, B., Michel, B., and Saint-Pierre, P. (2017). Correlation and variable importance
in random forests. Statistics and Computing, 27(3):659–678.
47
Ho, T. K. (1995). Random decision forests. In Proceedings of 3rd international conference
on document analysis and recognition, volume 1, pages 278–282. IEEE.
Kulkarni, V. Y. and Sinha, P. K. (2012). Pruning of random forest classifiers: A survey
and future directions. In 2012 International Conference on Data Science & Engineering
(ICDSE), pages 64–68. IEEE.
Liaw, A., Wiener, M., et al. (2002). Classification and regression by randomforest. R news,
2(3):18–22.
Louppe, G. (2014). Understanding random forests: From theory to practice. arXiv preprint
arXiv:1407.7502.
Louppe, G., Wehenkel, L., Sutera, A., and Geurts, P. (2013). Understanding variable
importances in forests of randomized trees. Advances in neural information processing
systems 26.
RColorBrewer, S. and Liaw, M. A. (2018). Package ‘randomforest’. University of Califor-
nia, Berkeley: Berkeley, CA, USA.
Schwarz, D. F., Konig, I. R., and Ziegler, A. (2010). On safari to random jungle: a fast im-
plementation of random forests for high-dimensional data. Bioinformatics, 26(14):1752–
1758.
Seligman, M. (2015). Rborist: extensible, parallelizable implementation of the random
forest algorithm. R Package Version, 1(3):1–15.
Storey, J. D. and Tibshirani, R. (2003). Statistical significance for genomewide studies.
Proceedings of the National Academy of Sciences, 100(16):9440–9445.
Strobl, C., Boulesteix, A.-L., Zeileis, A., and Hothorn, T. (2007). Bias in random forest
variable importance measures: Illustrations, sources and a solution. BMC bioinformatics,
8(1):1–21.
48
Te Beest, D. E., Mes, S. W., Wilting, S. M., Brakenhoff, R. H., and van de Wiel, M. A.
(2017). Improved high-dimensional prediction with random forests by the use of co-data.
BMC bioinformatics, 18(1):1–11.
Tibshirani, R. (1996). Bias, variance and prediction error for classification rules. Citeseer.
Wolpert, D. H. and Macready, W. G. (1999). An efficient method to estimate bagging’s
generalization error. Machine Learning, 35(1):41–55.
Wright, M. N. and Ziegler, A. (2015). ranger: A fast implementation of random forests for
high dimensional data in c++ and r. arXiv preprint arXiv:1508.04409.
Xu, B., Huang, J. Z., Williams, G., Wang, Q., and Ye, Y. (2012). Classifying very high-
dimensional data with random forests built from small subspaces. International Journal
of Data Warehousing and Mining (IJDWM), 8(2):44–63.
Zhang, J. and Zulkernine, M. (2006). A hybrid network intrusion detection technique
using random forests. In First International Conference on Availability, Reliability and
Security (ARES’06), pages 8–pp. IEEE.
49
Appendix
R Codes
1. Function for Simulating High-Dimensional Data
1 # ###########################################
2 # FUNCTIONS FOR HD RF
3 # ##########################################
4
5 require(MASS) # TO GENERATE MULTIVARIATE NORMAL DATA
6
7 rdat <- function(n=100, p0=10, beta0 , sigma=1, rho=0.5, s2=1,
8 model=c("linear", "MARS1", "MARS2", "Tree"))
9 {
10 p <- length(beta0) -1 # p - #PREDICTORS
11 mu <- rep(0, p)
12 Sigma <- s2*(matrix(rho , nrow=p, ncol=p) + diag(rep(1-rho , p)))
13 X <- mvrnorm(n =n, mu=mu, Sigma=Sigma)
14
15 if(NCOL(X) <5) stop("Please give me at least 5 predictors first.")
16 eps <- rnorm(n, mean=0, sd=sigma)
17 if (model=="linear") {
18 ymean <- cbind(1,X)%*%beta0
19 }
20
21 else if(model=="MARS1"){
22 ymean <- 0.1*exp(4*X[,1]) + 4/(1+exp(-20*(X[ ,2] -0.5))) + 3*X[,3] + 2
*X[,4] + X[,5]
23 }
24
25
50
26 else if(model=="MARS2"){
27 ymean <- 10*sin(pi*X[,1]*X[,2]) + 20*((X[ ,3] -0.5)^2) + 10*X[,4] + 5*
X[,5]
28 }
29
30 else{
31 ymean <- 5*I(X[,1]<=1)*I(X[ ,2] <=0.2) + 4*I(X[,3]<=1)*I(X[,4]<=-0.2)
+ 3*I(X[,5]<=0)
32 }
33
34 X0 <- matrix(rnorm(n*p0), nrow=n, ncol=p0) # NOISE PREDICTORS
35 y <- ymean + eps
36 dat <- data.frame(cbind(y, X, X0))
37 names(dat) <- c("y", paste("x", 1:(p+p0), sep=""))
38 return(dat)
39 }
2. Illustration of HD Problems
1 ####################################################################
2 # SIMULATION STUDIES DESIGNED TO ILLUSTRATE PROBLEMS IN RF WITH HD
3 ####################################################################
4
5 # =================
6 # SIMULATE DATA
7 # =================
8 ## Libraries Required
9 library(MASS)
10 library(tidyverse)
11 library(tibble)
12 library(gridExtra)
13 library(randomForest)
14 library(dplyr)
15 library(ggplot2)
16 source("Functions -HDRF.R")
51
17
18 set.seed (125)
19 beta0 <- c(0.5, 0.1, -0.4, 0.5, 1, 3)
20 n <- 500; n.test <- 5000
21 sigma <- 1; rho <- 0; s2 <- 1
22 p0 <- 500 # NUMBER OF NOISE PREDICTORS
23 mod <- "linear"; xcols.true <- 1:5
24 nrun <- 20
25 dat.test <- rdat(n=n.test , p0=p0 , beta0=beta0 , sigma=sigma , rho=rho ,
26 s2=s2 , model=mod)
27 y.test <- dat.test$y
28 X.test <- dat.test[, -1]
29 ntree <- 500; #mtry <- 2
30 MSE <- NULL
31 for (i in 1:nrun){
32 print(cbind(p0=p0 , rho=rho , run=i))
33 dat <- rdat(n=n, p0=p0 , beta0=beta0 , sigma=sigma ,
34 rho=rho , s2=s2 , model=mod)
35
36 ### TRUE VARIABLES (Oracle Method)
37 fit0.RF <- randomForest(y~x1 + x2 + x3 + x4 + x5 ,
38 data=dat , xtest=X.test[, xcols.true],
39 ytest=y.test , ntree=ntree , importance=FALSE)
40
41 ### ALL VARIABLES (Naive Method)
42 fit.RF <- randomForest(y~., data=dat , xtest=X.test ,
43 ytest=y.test , ntree=ntree , importance=FALSE)
44
45 ### Extracting MSEs
46 Mse <- cbind(run=i, n.trees =1:ntree , mse0=fit0.RF$mse , mse0.test=
fit0.RF$test$mse ,
47 mse=fit.RF$mse , mse.test=fit.RF$test$mse)
48 MSE <- rbind(MSE , Mse)
49 }
52
50
51 ### Renaming Columns OF MSEs Table
52 MSE <- as_tibble(MSE)
53 names(MSE) <- c("run", "n.trees", "mse0", "mse0.test",
54 "mse", "mse.test")
55 MSE
56
57 # ==========================
58 # EXPLORE THE RESULTS
59 # ==========================
60 MSE %>%
61 group_by(n.trees) %>%
62 summarise(run=1, amse0=mean(mse0), amse0.test=mean(mse0.test),
63 amse=mean(mse), amse.test=mean(mse.test)) -> aMSE
64 aMSE
65
66 ### MSE PLOTS WITH LEGENDS
67 colors.le = c("mse0"="skyblue", "mse"="coral", "amse0"="blue",
68 "amse"="red")
69 colors.lee = c("mse0.test"="skyblue", "mse.test"="coral",
70 "amse0.test"="blue", "amse.test"="red")
71
72 # TRAINING DATA
73 fig.1 <- ggplot(data=MSE , aes(x=n.trees , by=factor(run))) +
74 geom_line(aes(y=mse0 , color="mse0"), size =0.2) +
75 geom_line(aes(x=n.trees , y=mse , color="mse"), size =0.2) +
76 geom_line(data=aMSE , aes(y=amse0 , color="amse0"), size =1) +
77 geom_line(data=aMSE , aes(y=amse , color="amse"), size =1) +
78 xlab("number of trees") + ylab("MSE") +
79 labs(title="(a) Training MSE") +
80 theme(plot.title = element_text(hjust =0.5) ,
81 legend.position = c(0.75, 0.60),
82 legend.key.size = unit (0.6, "cm"),
83 legend.text = element_text(colour = "black", size = 8)) +
53
84 scale_color_manual(values = colors.le) +
85 labs(colour="Models")
86
87 # TEST DATA
88 fig.2 <- ggplot(data=MSE , aes(x=n.trees , by=factor(run))) +
89 geom_line(aes(y=mse0.test ,color="mse0.test"), size =0.2) +
90 geom_line(aes(x=n.trees , y=mse.test , color="mse.test"), size
=0.2) +
91 geom_line(data=aMSE , aes(y=amse0.test , color="amse0.test"),
size =1) +
92 geom_line(data=aMSE , aes(y=amse.test , color="amse.test"), size
=1) +
93 xlab("number of trees") + ylab("pMSE") +
94 labs(title="(b) MSE with Test Data") +
95 theme(plot.title = element_text(hjust =0.5) ,
96 legend.position = c(0.75, 0.60),
97 legend.key.size = unit (0.6, "cm"),
98 legend.text = element_text(colour = "black", size = 8)) +
99 scale_color_manual(values = colors.lee) +
100 labs(colour="Models")
101
102 grid.arrange(fig.1, fig.2, ncol=2, nrow =1)
54
3. Simulation studies on proposed method
1 ##########################################################
2 ## SIMULATION STUDIES DESIGNED TO FOR THE PROPOSED METHOD
3 ##########################################################
4 source("Functions -HDRF.R")
5
6 set.seed (125)
7 beta0 <- c(0.5, 0.1, -0.4, 0.5, 1, 3)
8 n <- 500; n.test <- 5000
9 sigma <- 1; rho <- 0.5; s2 <- 1
10 p0 <- 500 # NUMBER OF NOISE PREDICTORS
11 mod <- "linear"; xcols.true <- 1:5
12 nrun <- 100
13 dat <- rdat(n=n, p0=p0 , beta0=beta0 , sigma=sigma ,
14 rho=rho , s2=s2 , model=mod)
15 dat.test <- rdat(n=n.test , p0=p0 , beta0=beta0 ,
16 sigma=sigma , rho=rho ,
17 s2=s2 , model=mod)
18 y.test <- dat.test$y
19 X.test <- dat.test[, -1]
20 ntree <- 500; #mtry <- 2
21 lambda = 0.01 # Lambda value for ridge regression
22 MSE <- NULL
23
24 for (i in 1:nrun){
25 print(cbind(p0=p0 , rho=rho , run=i))
26
27 ### TRUE VARIABLES (Oracle Method)
28 fit0.RF <- randomForest(y~x1 + x2 + x3 + x4 + x5 ,
29 data=dat , xtest=X.test[, xcols.true],
30 ytest=y.test , ntree=ntree , importance=FALSE)
31
32
55
33 ### ALL VARIABLES (Naive Method)
34 fit.RF <- randomForest(y~., data=dat , xtest=X.test ,
35 ytest=y.test , ntree=ntree , importance=FALSE)
36
37 ### Variable Screening With Ridge regression
38 # Selecting best 15% importance variables
39 fit.ridge1 = lm.ridge(y~., data = dat , lambda = lambda)
40 beta.RR1 = fit.ridge1$coef
41 a1 = 0.30
42 pa1 = length(beta.RR1)
43 n.select1 = trunc(pa1*a1); #n.select
44 Xs.selected1 = names(sort(abs(beta.RR1), decreasing = TRUE)[1:n.
select1 ])
45 newdat1 = cbind(y=dat$y, dat[, Xs.selected1 ])
46 newX.test1 = X.test[Xs.selected1]
47
48 # Selecting best 50% importance variables
49 lambda2 = 0.01
50 fit.ridge2 = lm.ridge(y~., data = newdat1 , lambda = lambda2)
51 beta.RR2 = fit.ridge2$coef
52 a2 = 0.50
53 pa2 = length(beta.RR2)
54 n.select2 = trunc(pa2*a2); #n.select
55 Xs.selected2 = names(sort(abs(beta.RR2), decreasing = TRUE)[1:n.
select2 ])
56 newdat2 = cbind(y=dat$y, newdat1[, Xs.selected2 ])
57 newX.test2 = newX.test1[Xs.selected2]
58
59 ### PROPOSED METHOD
60 fitp.RF <- randomForest(y~., data=newdat2 , xtest=newX.test2 ,
61 ytest=y.test , ntree=ntree , importance=TRUE)
62
63
64
56
65 ## Mean Square Errors
66 Mse <- cbind(run=i, n.trees =1:ntree , mse0=fit0.RF$mse ,
67 mse0.test=fit0.RF$test$mse , mse=fit.RF$mse ,
68 mse.test=fit.RF$test$mse , mse1=fitp.RF$mse ,
69 mse1.test=fitp.RF$test$mse)
70 MSE <- rbind(MSE , Mse)
71 }
72
73 MSE <- as_tibble(MSE)
74 names(MSE) <- c("run", "n.trees", "mse0", "mse0.test", "mse",
75 "mse.test", "mse1", "mse1.test")
76 MSE
77
78 # ====================
79 # EXPLORE THE RESULTS
80 # ====================
81 MSE %>%
82 group_by(n.trees) %>%
83 summarise(run=1, amse1=mean(mse1), amse1.test=mean(mse1.test),
84 amse2=mean(mse2), amse2.test=mean(mse2.test)) -> aMSE
85 aMSE
86
87 #################################################
88 ######### PLOTS INCLUDING PROPOSED METHOD ########
89 #################################################
90
91 colors.le = c("mse1" = "green4", "mse2" = "blue")
92 colors.lee = c("amse1" = "green4", "amse2" = "blue")
93
94 # TRAINING MSE
95 fig1 <- ggplot(data=MSE , aes(x=n.trees , by=factor(run))) +
96 geom_line(aes(y=mse1 , color="mse1"), size =0.2) +
97 geom_line(aes(y=mse2 , color="mse2"), size =0.2) +
98 xlab("number of trees") + ylab("pMSE") +
57
99 labs(title="(a) Training MSE") +
100 theme(plot.title = element_text(hjust =0.5) ,
101 legend.position = c(0.75, 0.60),
102 legend.key.size = unit (0.8, "cm"),
103 legend.text = element_text(colour = "black", size = 8)) +
104 scale_color_manual(values = colors.le) +
105 labs(colour="Models")
106
107 # AVERAGE TRAINING MSE
108 fig11 <- ggplot(data=aMSE , aes(x=n.trees , by=factor(run))) +
109 geom_line(aes(y=amse1 , color="amse1"), size =1) +
110 geom_line(aes(y=amse2 , color="amse2"), size =1) +
111 xlab("number of trees") + ylab("pMSE") +
112 labs(title="(b) Average Training MSE") +
113 theme(plot.title = element_text(hjust =0.5) ,
114 legend.position = c(0.75, 0.60),
115 legend.key.size = unit (0.8, "cm"),
116 legend.text = element_text(colour = "black", size = 8)) +
117 scale_color_manual(values = colors.lee) +
118 labs(colour="Models")
119
120 grid.arrange(fig1 , fig11 , ncol=2, nrow =1)
121
122 ###########################################################
123
124 colors.le.test = c("mse1.test" = "green4", "mse2.test" = "blue")
125 colors.lee.test = c("amse1.test" = "green4", "amse2.test" = "blue")
126
127 ## TEST MSE
128 fig2 <- ggplot(data=MSE , aes(x=n.trees , by=factor(run))) +
129 geom_line(aes(y=mse1.test , color="mse1.test"), size =0.2) +
130 geom_line(aes(y=mse2.test , color="mse2.test"), size =0.2) +
131 xlab("number of trees") + ylab("pMSE") +
132 labs(title="(a) Test MSE") +
58
133 theme(plot.title = element_text(hjust =0.5) ,
134 legend.position = c(0.75, 0.60),
135 legend.key.size = unit (0.8, "cm"),
136 legend.text = element_text(colour = "black", size = 8)) +
137 scale_color_manual(values = colors.le.test) +
138 labs(colour="Models")
139
140 ## AVERAGE TEST MSE
141 fig22 <- ggplot(data=aMSE , aes(x=n.trees , by=factor(run))) +
142 geom_line(aes(y=amse1.test , color="amse1.test"), size =1) +
143 geom_line(aes(y=amse2.test , color="amse2.test"), size =1) +
144 xlab("number of trees") + ylab("pMSE") +
145 labs(title="(b) Average Test MSE") +
146 theme(plot.title = element_text(hjust =0.5) ,
147 legend.position = c(0.75, 0.60),
148 legend.key.size = unit (0.8, "cm"),
149 legend.text = element_text(colour = "black", size = 8)) +
150 scale_color_manual(values = colors.lee.test) +
151 labs(colour="Models")
152
153 grid.arrange(fig2 , fig22 , ncol=2, nrow =1)
59
4. Real Data Exploartion
1
2 ####################################################################
3 ################ REAL DATA EXPLORATION #############################
4 ####################################################################
5
6 # ====================================
7 # Communities and Crime Data Set
8 # ====================================
9 dat1 = read.csv(file = "communities.csv", header = TRUE , na="?")
10 head(dat1)
11 dim(dat1)
12 colSums(is.na(dat1))
13
14 ## Removing 5 non -predictive variables (state , county , community ,
communityname , and fold)
15 dat1 = select(dat1 , -c(state , county , community ,
16 communityname , fold , PolicPerPop))
17 dim(dat1)
18
19 ## CHECKING FOR MISSING VALUES
20 miss.info = function(dat , filename=NULL){
21 vnames = colnames(dat); vnames
22 n = nrow(dat)
23 out = NULL
24 for (j in 1: ncol(dat)){
25 vname = colnames(dat)[j]
26 x = as.vector(dat[,j])
27 n1 = sum(is.na(x), na.rm=T)
28 n2 = sum(x=="NA", na.rm=T)
29 n3 = sum(x=="", na.rm=T)
30 nmiss = n1 + n2 + n3
31 ncomplete = n-nmiss
60
32 out = rbind(out , c(col.number=j, vname=vname ,
33 mode=mode(x), n.levels=length(unique(x)),
34 ncomplete=ncomplete , miss.perc=nmiss/n))
35 }
36 out = as.data.frame(out)
37 row.names(out) = NULL
38 if (!is.null(filename)) write.csv(out , file = filename , row.names=F)
39 return(out)
40 }
41
42 miss.info(dat = dat1)
43
44 ## Removing Predictors with more than 50% missing values
45 dat = dat1[, -c(26, 97:112 , 116:119 , 121)]
46 dim(dat)
47 miss.info(dat=dat)
48
49 ## Data imputation for Predictors with less than 50% missing values
50 dat_impute = mice(dat , m=5, method = "pmm", maxit = 30, seed = 100)
51 dat_imputed1 = complete(dat_impute , 3) #Selecting the third
predicted imputation values.
52 dat1 = as.data.frame(dat_imputed1)
53 colSums(is.na(dat1))
54 miss.info(dat = dat1)
55
56 # Generating Multivariate Normal Data (Noise Variables)
57 n = nrow(dat)
58 p0 = 700
59 set.seed (100)
60 X0 = round(matrix(rnorm(n*p0), nrow=n, ncol=p0), 2) # Noise
61 dat = data.frame(cbind(dat , X0))
62 head(dat)
63 dim(dat)
64
61
65 ## Data partitioning
66 set.seed (100)
67 sample = sample.split(dat$ViolentCrimesPerPop , SplitRatio = (2/3))
68 training = subset(dat , sample == TRUE)
69 test = subset(dat , sample == FALSE)
70 dim(training)
71 dim(test)
72
73 #############################################
74 ############# MODELS FITTING #################
75 #############################################
76 y.test = test$ViolentCrimesPerPop
77 X.test = test[, -test$ViolentCrimesPerPop]
78 ntree = 500
79 lambda = 0.1 # Lambda value for ridge regression
80
81 ## Ridge regression
82 # Selecting best 30% importance variables
83 fit.ridge1 = lm.ridge(ViolentCrimesPerPop~., data = training , lambda
= best.lambda)
84 beta.RR1 = fit.ridge1$coef
85 a1 = 0.30
86 pa1 = length(beta.RR1)
87 n.select1 = trunc(pa1*a1); #n.select
88 Xs.selected1 = names(sort(abs(beta.RR1), decreasing = TRUE)[1:n.
select1 ]); #Xs.selected
89 newdat1 = cbind(ViolentCrimesPerPop=training$ViolentCrimesPerPop ,
training[, Xs.selected1 ])
90 newX.test1 = X.test[Xs.selected1]
91
92 # Selecting best 50% importance variables
93 fit.ridge2 = lm.ridge(ViolentCrimesPerPop~., data = newdat1 , lambda
= best.lambda)
94 beta.RR2 = fit.ridge2$coef
62
95 a2 = 0.50
96 pa2 = length(beta.RR2)
97 n.select2 = trunc(pa2*a2); #n.select
98 Xs.selected2 = names(sort(abs(beta.RR2), decreasing = TRUE)[1:n.
select2 ]); #Xs.selected
99 newdat2 = cbind(ViolentCrimesPerPop=training$ViolentCrimesPerPop ,
newdat1[, Xs.selected2 ])
100 newX.test2 = newX.test1[Xs.selected2]
101 dim(newdat2)
102
103 #### FITTING THE PROPOSED RF MODEL
104 mtry1 <- tuneRF(newdat2[ , -1], newdat2$ViolentCrimesPerPop ,
105 ntreeTry =100, stepFactor =2, improve =0.05 , trace=TRUE ,
106 plot=FALSE , dobest=FALSE)
107 best.mtry1 = mtry1[mtry1[, 2] == min(mtry1[, 2]), 1]
108 best.mtry1
109
110 fit.RF2 <- randomForest(ViolentCrimesPerPop~., data=newdat2 ,
111 xtest=newX.test2 , ytest=y.test , mtry=best.mtry1 ,
112 ntree=ntree , importance=TRUE)
113 print(fit.RF2)
114 varImpPlot(fit.RF2)
115
116 #### FITTING THE NAIVE RF MODEL
117 mtry <- tuneRF(training[ , -100], training$ViolentCrimesPerPop ,
118 ntreeTry =100, stepFactor =2, improve =0.05 , trace=TRUE ,
119 plot=FALSE , dobest=FALSE)
120 best.mtry = mtry[mtry[, 2] == min(mtry[, 2]), 1]
121
122 fit.RF1 <- randomForest(ViolentCrimesPerPop~., data=training ,
123 xtest=X.test , ytest=y.test , mtry=best.mtry ,
124 ntree=ntree , importance=TRUE)
125 print(fit.RF1)
126 varImpPlot(fit.RF1)
63
127 #=======================================
128 #============== PLOTS MSE ===============
129 #=======================================
130 colors.le = c("mse1.training" = "green4", "mse2.training" = "blue")
131 colors.lee = c("mse1.test" = "green4", "mse2.test" = "blue")
132
133 # TRAINING MSE
134 fig1 <- ggplot(data=MSE , aes(x=n.trees)) +
135 geom_line(aes(y=mse1.training , color="mse1.training"),
136 size =0.8) +
137 geom_line(aes(y=mse2.training , color="mse2.training"),
138 size =0.8) +
139 xlab("number of trees") + ylab("pMSE") +
140 labs(title="(a) MSE For Training Set") +
141 theme(plot.title = element_text(hjust =0.5) ,
142 legend.position = c(0.70, 0.60),
143 legend.key.size = unit (1.0, "cm"),
144 legend.text = element_text(colour = "black", size = 10)) +
145 scale_color_manual(values = colors.le) +
146 labs(colour="Models")
147
148 # TEST MSE
149 fig2 <- ggplot(data=MSE , aes(x=n.trees)) +
150 geom_line(aes(y=mse1.test , color="mse1.test"), size =0.8) +
151 geom_line(aes(y=mse2.test , color="mse2.test"), size =0.8) +
152 xlab("number of trees") + ylab("pMSE") +
153 labs(title="(b) MSE For Test Set") +
154 theme(plot.title = element_text(hjust =0.5) ,
155 legend.position = c(0.70, 0.80),
156 legend.key.size = unit (1.0, "cm"),
157 legend.text = element_text(colour = "black", size = 10)) +
158 scale_color_manual(values = colors.lee) +
159 labs(colour="Models")
160 grid.arrange(fig1 , fig2 , ncol=2, nrow =1)
64
161
162 ### Out -of-Bag Error Plots for the two Models
163 plot2 = plot(fit.RF2 , main = "OOB Estimate Errors For Naive and
Proposed RF Models", col="blue", type = "l", lwd =1.8)
164 plot(fit.RF1 , add=TRUE , col="green4", type = "l", lwd =1.8)
165 legend("topright", legend=c("Naive Model", "Proposed Model"),
166 col=c("green4", "blue"), lty=1:1, cex=1.2, lwd=2)
65
Curriculum Vitae
Roland Fiagbe was born on September 15, 1993, the first child of the late Mr. John Kwaku
Fiagbe and Janet Karikari. He is a past student of Osei Kyeretwie Senior High school in
Kumasi - Ghana, who graduated in 2013. In 2014, he then entered Kwame Nkrumah
University of Science and Technology where he obtained his Bachelor’s degree in Statistics
with first class honors in 2018. During Roland’s time in school he served as a financial
committee member for the Association of Mathematics and Statistics Students (AMSS) in
2015−2016, treasurer for the Association of Mathematics and Statistics Students (AMSS)
in 2016− 2017 and treasurer for the Ghana National Association of Mathematics Students
(NAMS-GH) in 2018− 2019. After completion of his Bachelor’s degree, he worked at the
Electoral Commission of Ghana, Ashanti regional office in Kumasi from 2018 − 2019. At
the Commission, Roland was actively involved in data analyses and administrative duties.
On the strength of his outstanding academic performance, and desire to pursue higher
education, he entered into a graduate program at The University of Texas at El Paso in
Fall 2019 to pursue a master’s degree in Statistics. While pursuing his master’s degree, he
worked as a Graduate Teaching Assistant (GTA) at the Mathematical Sciences Depart-
ment where he assisted professors and worked at the Math Resource Center for Students
(MaRCS) of the university to help students from all disciplines and programs.
Roland worked under the supervision of Professor Xiaogang Su on a thesis titled ”High-
Dimensional Random Forests”. After obtaining his master’s degree in Spring 2021, Roland
will pursue his doctoral degree in Big Data Analytics at the University of Central Florida
in Orlando Florida starting Fall 2021.
Email address: [email protected]
66