High-Dimensional Random Forests

University of Texas at El Paso University of Texas at El Paso

ScholarWorks@UTEP ScholarWorks@UTEP

Open Access Theses & Dissertations

2021-05-01

High-Dimensional Random Forests High-Dimensional Random Forests

Roland Fiagbe University of Texas at El Paso

Follow this and additional works at: https://scholarworks.utep.edu/open_etd

Part of the Statistics and Probability Commons

Recommended Citation Recommended Citation Fiagbe, Roland, "High-Dimensional Random Forests" (2021). Open Access Theses & Dissertations. 3252. https://scholarworks.utep.edu/open_etd/3252

This is brought to you for free and open access by ScholarWorks@UTEP. It has been accepted for inclusion in Open Access Theses & Dissertations by an authorized administrator of ScholarWorks@UTEP. For more information, please contact [email protected].

https://scholarworks.utep.edu/

https://scholarworks.utep.edu/open_etd

https://scholarworks.utep.edu/open_etd?utm_source=scholarworks.utep.edu%2Fopen_etd%2F3252&utm_medium=PDF&utm_campaign=PDFCoverPages

http://network.bepress.com/hgg/discipline/208?utm_source=scholarworks.utep.edu%2Fopen_etd%2F3252&utm_medium=PDF&utm_campaign=PDFCoverPages

https://scholarworks.utep.edu/open_etd/3252?utm_source=scholarworks.utep.edu%2Fopen_etd%2F3252&utm_medium=PDF&utm_campaign=PDFCoverPages

mailto:[email protected]

HIGH-DIMENSIONAL RANDOM FORESTS

ROLAND FIAGBE

Master’s Program in Statistics

APPROVED:

Xioagang Su, Ph.D., Chair

Wen-Yee Lee, Ph.D.

Naijun Sha, Ph.D.

Stephen L. Crites Jr., Ph.D.Dean of the Graduate School

c©Copyright

Roland Fiagbe

2021

Dedicated to my

Late FATHER John K. Fiagbe, MOTHER Janet Karikari, SIBLINGS Marcel and Adwoaand my ADVISOR Professor Xiaogang Su

with love

HIGH-DIMENSIONAL RANDOM FORESTS

by

ROLAND FIAGBE

THESIS

Presented to the Faculty of the Graduate School of

The University of Texas at El Paso

in Partial Fulfillment

of the Requirements

for the Degree of

MASTER OF SCIENCE

Department of Mathematical Sciences

THE UNIVERSITY OF TEXAS AT EL PASO

May 2021

Acknowledgements

First and foremost, it is a genuine pleasure to express my heartfelt gratitude to my re-

search supervisor, Professor Xiaogang Su of the Mathematical Sciences Department of the

University of Texas at El Paso, for his constant guidance, support, encouragement and en-

during patience. In spite of his busy schedules, he was always readily available to provide

clear explanations of concepts whenever I did not seem to make headway. Without his

dedication and persistent help this thesis would not have been possible.

Also, I would like to thank my committee members, Professor Naijun Sha of the Mathe-

matical Sciences Department, and Professor Wen-Yee Lee of the Chemistry Department

both at the University of Texas El Paso for their comments, suggestions, additional sup-

port and guidance which were very valuable to the completion of this work.

Last but not least, I am extremely thankful to my family, colleagues and friends for their

supports and inspirations and everyone who played a role in my academic accomplishments.

v

Abstract

The significant advances in technology have enabled easy collection and management of

high-dimensional data in many fields, however, the process of modeling these data imposes

a huge problem in the field of data science. Dealing with high-dimensional data is one

of the significant challenges that degenerate the performance and precision of most clas-

sification and regression algorithms, e.g., random forests. Random Forest (RF) is among

the few methods that can be extended to model high-dimensional data; nevertheless, its

performance and precision, like others, are highly affected by high dimensions, especially

when the dataset contains a huge number of noise or noninformative features. It is known

in literature that data dominated with high number of uninformative features have small

number of expected informative variables that could lead to the challenge of obtaining an

accurate or robust random forest model.

In this study, we present a new algorithm that incorporates ridge regression as a variable

screening tool to discern informative features in the setting of high dimensions and apply

the classical random forest to a top portion of selected important features. Simulation

studies on high dimensions are carried out to test how our proposed method addresses

the above problem and improves the performance of random forest models. To illustrate

our method, we applied it to a real life dataset (Communities and Crime Dataset), which

was sourced from the UCI database. The results show how variable screening using ridge

regression could be a very useful tool for building high-dimensional random forests.

vi

Table of Contents

Page

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

Chapter

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Problems in High-Dimensional Random Forests . . . . . . . . . . . . . . . 3

1.2 Overview of Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Review of Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 History of Random Forests (RF) . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Definition of Random Forest Model . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Characterizing the Accuracy of Random Forests . . . . . . . . . . . . . . . 8

2.3.1 Convergence of Random Forests . . . . . . . . . . . . . . . . . . . . 8

2.3.2 Strength and Correlation of Random Forests . . . . . . . . . . . . . 8

2.4 Out-of-Bag Sample and Out-of-Bag Error . . . . . . . . . . . . . . . . . . . 11

2.5 Mean Squared Error (MSE) . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.6 Variable Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.7 Problems in High Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.7.1 Simulation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.8 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 High-Dimensionl Random Forests with Ridge Regression Based Variable Screening 23

3.1 Ridge Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 Ridge Penalty (λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

vii

3.3 Variable Screening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1 Simulation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2 Comparing Performance of the Proposed Model to the Existing Model . . . 30

4.2.1 Simulation Results from Linear Model . . . . . . . . . . . . . . . . 32

4.2.2 Simulation Results from Tree Model . . . . . . . . . . . . . . . . . 35

5 Real Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.1 Communities and Crime Dataset . . . . . . . . . . . . . . . . . . . . . . . 38

5.1.1 Presentation of Results from Real Data Application . . . . . . . . . 40

6 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.1 Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Appendix

Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Curriculum Vitae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

viii

List of Tables

4.1 Average Prediction MSE Values for linear model 2.20 with ρ = 0 . . . . . . 32

4.2 Average Prediction MSE Values for linear model 2.20 with ρ = 0.5 . . . . . 33

4.3 Average Prediction MSE Values for tree model 2.22 with ρ = 0 . . . . . . . 35

4.4 Average Prediction MSE Values for tree model 2.22 with ρ = 0.5 . . . . . . 36

5.1 Results of analysis of Communities and Crime Dataset. Comapring the

proposed method to the existing (naive) random forest method in dealing

with high dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.2 Prediction MSE values of training and test set for the proposed method and

existing (naive) random forest method. . . . . . . . . . . . . . . . . . . . . 40

ix

List of Figures

2.1 Prediction MSE for linear model when ρ = 0 . . . . . . . . . . . . . . . . . 17

2.2 Prediction MSE for linear model when ρ = 0.5 . . . . . . . . . . . . . . . . 17

2.3 Prediction MSE for MARS model when ρ = 0 . . . . . . . . . . . . . . . . 18

2.4 Prediction MSE for MARS model when ρ = 0.5 . . . . . . . . . . . . . . . 18

2.5 Prediction MSE for tree model when ρ = 0 . . . . . . . . . . . . . . . . . . 19

2.6 Prediction MSE for tree model when ρ = 0.5 . . . . . . . . . . . . . . . . . 19

4.1 Prediction MSE and AMSE for the simulation runs of the three models:

Naive (mse, amse), Oracle (mse0, amse0), and Proposed (mse1, amse1) for

training dataset with ρ = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . 32



test dataset with ρ = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33



training dataset with ρ = 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . 34



test dataset with ρ = 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34



training dataset with ρ = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . 35



test dataset with ρ = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

x



training dataset with ρ = 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . 37



test dataset with ρ = 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.1 Out-of-Bag Estimate Errors of both models . . . . . . . . . . . . . . . . . 41

5.2 Comparing Prediction MSE for Training and Test Sets . . . . . . . . . . . 41

xi

Chapter 1

Introduction

One of the current significant challenges in predictive modelling is high-dimensional data.

These high-dimensional data are usually found in areas such as ecology, medicine, or biol-

ogy where a large number of measurements are produced by DNA microarray technology,

finance, industries that produce text data etc. There are two special characteristics of

high-dimensional data that significantly contribute to the low performance of classification

and regression algorithms. One characteristic is that the data set usually contains different

classes of variables that do not relate to the response variable. For instance, in text data,

some features would relate to sports while others would relate to music. The other charac-

teristic is that, with high-dimensional data, there is a high chance that a large number of

variables are uninformative to the class feature. Uninformative variables are weakly corre-

lated to the class feature and therefore have low power in predicting the response feature.

Random Forest was first introduced by (Breiman, 2001) and it is known as a non-parametric

statistical learning method from the ensemble decision trees approaches. Random Forest

has been described in most recent studies as one of the state-of-the-art machine learning

methods for prediction and classification, and has been very successful in performing its

purposes (Capitaine et al., 2020). In short, it is referred to a classification technique that

is built with numerous decision trees. In competition with modern and advanced methods

such as Support Vector Machine (SVM) and boosting, random forest models have emerged

to be the fastest and easy to implement model, have highly accurate prediction results and

are able to handle a very large number of variables without overfitting (Biau, 2012). It

employes bagging and randomness when building each tree which results in creating an

uncorrelated forest of trees whose prediction is more accurate than the individual trees.

1

Based on bagging and random feature selection, a number of decision trees is generated and

majority voting is taken for classification (Kulkarni and Sinha, 2012) or by averaging over

the predictions of tree models for regression trees (Denil et al., 2014). Each tree in the for-

est depends on the values of randomly sampled data with replacement known as bootstrap

samples. For each tree grown on a bootstrap sample, the error rate for observations left

out of the bootstrap sample is monitored. This is called the out-of-bag (OOB) error rate.

Approximately one-third of the bootstrap sample is left out (known as out-of-bag sample)

and is used to compute unbiased classification error estimate as more trees are added to

the forest. This algorithm does not need further cross-validation to obtain estimate of the

test error or generalization error (Zhang and Zulkernine, 2006).

Since its introduction, random forests initially known as a single algorithm have been de-

veloped into an entire framework of models (Denil et al., 2014) and have seen significant

applications in various fields of science, technology and humanities. An early approach of

Random Forest was developed by (Breiman, 1996a) known as bagging. In this initial ap-

proach, a random selection without replacement is made from the training set to grow each

tree. Another approach is random split selection developed by (Dietterich, 1998) where at

each node the split is selected at random from among permissible splits. Breiman’s initial

approach of bagging (Breiman, 1996a) was effective in reducing the variance of regression

predictors but its major limitations were leaving the bias unchanged and high correlation

among trees. These concerns motivated the implementation of Random Forest. According

to Xu et al. (2012), among the numerous methods proposed to build random forest models

from subspaces, the method proposed by (Breiman, 2001) has been the mainstream and

most applied due to its higher performances compared to other methods.

In spite of the fact that there has been the latest and significant development in technol-

ogy that enables the collection and management of high-dimensional data in many fields,

the process of analyzing and handling these data has still been problematic in the field

of data science. Performing analysis and prediction with high-dimensional data has al-

ways been challenging owing to the large number of predictor variables relative to the

2

size of the sample. Random Forest (RF) is among the few methods that can be extended

to model high-dimensional data; nevertheless, its performance, like others, is affected by

high dimension, especially when the dataset contains a huge number of noise predictors

and complex correlations among variables. Moreover, the presence of correlated variables

in a high-dimensional dataset affects the ability of the random forest model to find the

true predictors, and this results in decreasing the estimated important scores of predictors

(Gregorutti et al., 2017). Recently, the issue of complexities and analytical challenges asso-

ciated with high-dimensional data in random forest models have been tackled by numerous

researchers. In this project, we try to come up with a novel feature screening method for

fitting a random forest model, hence we propose using Ridge Regression as method of

variable screening and apply the classical random forest to a top portion of important

variables.

1.1 Problems in High-Dimensional Random Forests

From past studies, random forest has been known to not have the best performance on

some data characteristics and when dealing with very high dimensional data. Breiman’s

method has been known to work well with certain dimensions of data but its performance

is low when dealing with very high dimensional data (Xu et al., 2012). One major step in-

volved in building a random forest model is the random selection of features when splitting

data (bootstrap samples) for growing classifiers (decision trees). However, this step does

not seem suitable for high dimensional data, because such type of data usually contains

many variables which are uninformative to the target variable. For a general random forest

model, there is no criterion for selecting informative variables. All variables are sampled by

uniform random sampling. As a result, random selection of splitting variables at each split

would contain less or no informative variables, especially in cases where the percentage

of truly informative variables is small. Uninformative variables (also termed noise vari-

ables) are weakly correlated to the target variable and therefore have low predictive power

3

with respect to the target variable. Therefore, weak trees are built in the forest and the

classification performance of the random forest model significantly declines (Amaratunga

et al., 2008). Xu et al. (2012) stated that the subspace sizes used to build each tree has

to be enlarged to help increase the chance of selecting truly informative characteristics.

This would produce a less noisy subspaces and improve the strenght of each tree in the

forest. Moreover, this algorithm involves numerous computations and is likely to increase

correlation among the resulting trees in the forest. Highly correlated trees also significantly

reduce the performance of random forest models (Breiman, 2001).

1.2 Overview of Study

The overview of this project is as follows: Chapter 2 highlights the history of random forest

model, some theorotical background, and some related works of high-dimensional random

forest. In chapter 3, we provide a detailed theorotical work on the proposed method, thus

applying ridge estimator as a variable screening method and applying a general random

forest model to a top portion of important features. Chapter 4 presents a simulation study

on illustrating the problems associated with high-dimensional data and a simulation study

on the proposed method. The proposed method is applied to a real life data set and the

results are presented in Chapter 5. Finally, chapter 6 presents some discussions, conclusion

and further studies.

4

Chapter 2

Review of Literature

This chapter briefly introduces the history of random forest, methods, justifications and

highlights some theorotical backgrounds of random forests. The difficulty encountered by

RF when facing high-dimensional data then is elaborated and several available approaches

in the literature are presented.

2.1 History of Random Forests (RF)

Random Forest (RF) is considered as a refinement of bagged trees. Its name was derived

from random decision forests that was initially proposed by Ho (1995). Breiman (1996a)

bagging idea and Ho (1995) random subspace method were combined to construct the

collection of decision trees with controlled variations. The work of Amit and Geman (1997)

greatly influenced Breiman’s notion in his early development of random forests. Amit and

Geman (1997) first presented the idea of growing an individual tree by searching over a

random subset of the accessible decisions at the node splitting stage. However, Dietterich

(2000) finally introduced the idea of randomized node optimization. His method proposed

that, the decision at each node must be selected by the process of randomization. After

putting these ideas together, the method of building a forest of trees was first established

by Breiman (2001). In order to deal with variations among the trees, the training data

was projected into a randomly sampled subspace (known as Bootstrap sample) before

growing the trees. His paper introduced a procedure of building a random forest consisting

of uncorrelated trees using the following procedures; Classification and Regression Trees

(CART), randomized node optimization and bagging. Moreover, he introduced an estimate

5

of the generalization error of a forest using out-of-bag errors as well as several useful

concomitant features including proximity matrix, partial dependence plot, and variable

importance ranking.

2.2 Definition of Random Forest Model

Random Forest is defined as a classifier made up of a collection of tree-structured classifiers

{r(X,Φk, Dn), n > 1}, where;

• r(.) is the regresssion or classification function estimate,

• Dn = {(X1, Y1), . . . , (Xn, Yn)} is the training sample of i.i.d. [0, 1]d ×IR-valued ran-

dom variables (d ≥ 2) where the [0, 1]d is equipped with the standard euclidean

metric,

• Φ1,Φ2, ...,Φk are independent and identically (i.i.d) distributed random vectors, and

• X is an input vector conditional on Φk and Dn.

It is a supervised learning algorithm that utilizes an ensemble technique from both classifi-

cation and regression (Berdugo et al., 2020). For classification, each classifier or tree casts

a vote for the most popular class whereas in regression the random vectors are combined

to form the regression estimate denoted as

r(X,Φk) = EΦ [r(X,Φk, Dn)] (2.1)

where EΦ is the expectation with respect to the random parameter that is conditionally on

X and the training sample Dn. The individual random trees are constructed in the follow-

ing procedures; According to Biau (2012), all the tree nodes are related with rectangular

cells with the end goal that at each progression of the tree construction, the assortment of

cells related with the leaves of the tree forms a partition of [0, 1]d. Also, the root of the

6

tree forms a partition of [0, 1]d itself. At each node of the tree construction, a coordinate of

X =(X(1), X(2), ..., X(J)

)is selected, with the probability of the jth feature being selected

given as pnj ∈ (0, 1). Ones the coordinate is selected at each node, the split is then located

at the midpoint of the chosen side. The procedure above is repeated [log2 λn] times where

λn ≥ 2 is a deterministic parameter depending on n and is fixed beforehand by the user.

However, each randomized classifier (tree) rn(X,Φ) produces the average over all Yi for

which the corresponding vectors Xi fall in the same cell of the random partition as X.

This is given as;

rn(X,Φ) =

∑ni=1 Yi1[Xi∈An(X,Φ)]∑ni=1 1[Xi∈An(X,Φ)]

1En(X,Φ) (2.2)

where An(X,Φ) represent the rectangular cell of the random partition containing X and

the event of En(X,Φ) is defined by

En(X,Φ) =

[n∑i=1

1[Xi∈An(X,Φ)] 6= 0

](2.3)

Now, to compute the random forest estimate, we take an expectation of the classifiers

rn(X,Φ) with respect to the parameter Φ. Therefore, random forest regression estimate is

given as;

rn(X) = EΦ [rn(X,Φ)] = EΦ

[∑ni=1 Yi1[Xi∈An(X,Φ)]∑ni=1 1[Xi∈An(X,Φ)]

1En(X,Φ)

](2.4)

Now considering some general remarks about the random forest model, at the tree con-

struction stage, each tree is made up of 2(log2(λn)) ≈ λn number of terminal nodes and each

leaf 2−(log2(λn)) ≈ 1λn

. At each node of each tree, a candiddate variable X(j) is chosen with

probability pnj ∈ (0, 1). Since each variable X(j) has uniform distribution [0, 1]d, it implies

that∑J

j=1 pnj = 1

7

2.3 Characterizing the Accuracy of Random Forests

2.3.1 Convergence of Random Forests

Breiman (2001) defined a margin function of random forest from the ensemble of classifiers

r1(X,Φ), r2(X,Φ), ..., rK(X,Φ), and the training set randomly drawn from the distribution

of the random vector Y,X. The margin function is denoted as;

mg(X, Y ) = avkI(rk(X,Φ) = Y )−maxτ 6=Y

avkI(rk(X,Φ) = τ), k = 1, . . . , K (2.5)

where I(·) is an indicator function. The margin function estimates the degree to which the

mean number of votes at X, Y for the correct class (Y ) surpasses the mean vote in favor

of any other class (τ). The higher the margin, the higher the accurate of the forest.

We now define a generalization error as

PE∗ = PX,Y (mg(X, Y ) < 0) (2.6)

where the probability is with respect to the X, Y space.

Theorem 2.3.1. Following the Strong Law of Large Number (SLLN), as the number of

trees or classifiers increases, for almost surely all the sequences of Φ1,Φ2, ...,Φk, PE∗

converges to

PX,Y

(PΦ(rk(X,Φ) = Y )−max

τ 6=YPΦ(rk(X,Φ) = τ) < 0

)(2.7)

This theorem demonstrates why random forest models do not overfit as the number of

trees increases, but yield a limit of the generalization error, PE∗.

2.3.2 Strength and Correlation of Random Forests

In random forests the two most important properties are the strength and correlation.

These two parameters measure the accuracy of the individual classifiers and the dependence

8

between them, respectively. Breiman (2001) introduced these two properties for calculating

the upper bound of the generalization error (Bernard et al., 2010).

The margin function for a random forest is given as

mr(X, Y ) = PΦ(rk(X,Φ) = Y )−maxτ 6=Y

PΦ(rk(X,Φ) = τ) (2.8)

and the strength of the set of classifiers is also given as

s = EX,Ymr(X, Y ) (2.9)

We assume s ≥ 0 and by applying Chebychev’s inequality gives

PE∗ ≤ var(mr)

s2(2.10)

The variance of mr is derived in the following process;

τ(X, Y ) = argmaxτ 6=Y

PΦ(rk(X,Φ) = τ)

mr(X, Y ) = PΦ(rk(X,Φ) = Y )− PΦ(rk(X,Φ) = τ(X, Y ))

= EΦ [I(rk(X,Φ) = Y )− I(rk(X,Φ) = τ(X, Y ))]

However, the raw margin function is

rmg(Φ, X, Y ) = I(rk(X,Φ) = Y )− I(rk(X,Φ) = τ(X, Y ))

Hence, mr(X, Y ) is the expectation of rmg(Φ, X, Y ) with respect to Φ. This implies that

mr(X, Y )2 = EΦ,Φ′rmg(Φ, X, Y )rmg(Φ′, X, Y ) (2.11)

9

Now,

var(mr) = EΦ,Φ′(covX,Y rmg(Φ, X, Y )rmg(Φ′, X, Y ))

= EΦ,Φ′(ρ(Φ,Φ′)sd(Φ)sd(Φ′))

where ρ(Φ,Φ′) is the correlation between rmg(Φ, X, Y ) and rmg(Φ′, X, Y ), and sd(Φ) is

the standard deviation of rmg(Φ, X, Y ).

ρ represent the mean value of the correlation. Thus,

ρ =EΦ,Φ′(ρ(Φ,Φ′)sd(Φ)sd(Φ′))

EΦ,Φ′(sd(Φ)sd(Φ′))

Therefore,

var(mr) = ρ(EΦsd(Φ))2 ≤ ρEΦvar(Φ) (2.12)

EΦvar(Φ) ≤ EΦ(EX,Y rmg(Φ, X, Y ))2 − s2 ≤ 1− s2 (2.13)

Theorem 2.3.2. The upper bound for the generalization error is derived from the strength

(s) of the individual classifiers and the correlation (ρ) between the classifiers with regards

to the margin functions. The generalization error is given by

PE∗ ≤ ρ(1− s2)

s2(2.14)

In growing each tree, randomly selected inputs or combinations of inputs are used at

each node. However, in order to improve the accuracy of the forest, the randomness infused

has to minimize the correlation (ρ) among trees while maintaining the strength of the trees

or classifiers. The resulting forests would have the following desirable characteristics:

• Its precision is just about as accurate as or sometimes better than Adaboost.

• It is moderately robust to anomalies and noise.

• It is computationally efficient than bagging or boosting.

10

• It produces applicable estimates of error, the strength of classifiers, correlation, and

variable importance.

2.4 Out-of-Bag Sample and Out-of-Bag Error

The idea of using out-of-bag estimates in estimating the generalization error was proposed

by Wolpert and Macready (1999) and Tibshirani (1996). Wolpert and Macready (1999)

proposed the procedures for estimating the bagged predictors generalization error in re-

gression type problems whereas Tibshirani (1996) proposed the method of estimating the

generalization error for random classifiers by using out-of-bag estimates of variance. More-

over, the work of Breiman (1996b) on bagged classifiers’ error estimation provides empirical

evidence to demonstrate that the out-of-bag estimate is equally accurate to using a test set

of the same size as the training set from the data. In the process of growing trees in the

forest, a sequence of bootstrap samples TB,1, TB,2, ..., TB,K of the same size are randomly

generated from the training set T . Thereafter, K trees are constructed such that the kth

tree ϕk(x, TB,k) depends on the kth bootstrap sample.

Assuming there are N rows in the bootstrap training set TB, by random selection, the

probability of not selecting a row is;N − 1

N

By using sampling-with-replacement technique, by random selection, the probability of not

selecting N rows in is; (N − 1

N

)Nhowever, as N becomes large (N →∞),

limN→∞

(N − 1

N

)N= lim

N→∞

(1− 1

N

)N= e−1 = 0.368 ≈ 37%

11

Therefore, aproximately 37% of each bootstrap training set TB (known as the out-of-bag

sample) are left out to form accurate estimates for important quantities such as; to give

much improved estimated probabilities of the nodes and to estimate the generalization

error. This generalization error is refer to as out-of-bag error. In conclusion, the out-of-

bag error estimate eliminates the requirement for setting aside a test set for cross-validation.

2.5 Mean Squared Error (MSE)

As illustrated above, according to the used of random sampling in constructing bootstrap

samples, approximately 37% of the observations are not used for any individual classifier

(tree). This is known as the out-of-bag (OOB) sample for each classifier. These OOB

samples are used to estimate the prediction accuracy of the forest. It is computed as;

MSE(OOB) =1

n

n∑j=1

(yj − yjOOB)2 (2.15)

where n is the total number of observations of the OOB sample, and yjOOBis the average

of the out-of-bag predictions for the jth observation. Moreover, the percentage of variance

explained, R2(OOB) can be obtained by

R2(OOB) = 1−

MSE(OOB)

SST(2.16)

where SST =∑n

j=1 (yj − y)2 is the Sum of Square Total and y denotes the overall mean

of the observations.

For tree t, the out-of-bag mean squared error of each tree is computed as

MSEt(OOB)=

1

nt(OOB)

nt(OOB)∑j=1

(yj − ytj

)2(2.17)

12

where nt(OOB)is the number of observations in the OOB sample of tree t and ytj is the

average of the OOB predictions for the jth observation of tree t. Now, for R number of

runs, the average mean squared error by run can be computed as

AMSEt(OOB)=

1

R

1

nt(OOB)

R∑r=1

nt(OOB)∑j=1

(yj − ytj

)2(2.18)

where R is the total number of runs (iterations).

2.6 Variable Importance

One of the most important objectives of machine learning aside from finding the best and

most accurate model of the response variable is to identify among the predictor variables

the predictors that are most important and relevant to make predictions. This enables

one to have an in-depth understanding of the problem under study. Random forest models

provide several methods for assessing the significance of predictors that best explain a

response variable and subsequently improves the interpretability of the model (Louppe,

2014).

In random forest models as well as other tree-based models, a naive measure of selecting

important variables is to simply check the number of times each predictor is selected

by all individual trees in the forest (Strobl et al., 2007). Friedman (2001) presented a

more detailed variable importance measure that incorporates a (weighted) mean of each

individual classifier’s improvement in splits produced by each variable. An example is the

Gini Importance that is available in random forest implementation for variable importance

measure in classification and average impurity reduction for regression trees. Breiman

(2001, 2002) proposed how to determine the importance of a variable Xj for predicting a

response variable Y by adding up the decreases of the weighted impurity p(t)4 i(st, t) for

every node t where variable Xj is used, and average them over all trees NT in the forest

13

(Louppe et al., 2013). This is given as;

Imp(Xj) =1

NT

∑T

∑t∈T :v(st)=Xj

p(t)4 i(st, t) (2.19)

where v(st) denote the variable used for splitting node t and the proportion Nt

Nof samples

reaching t is denoted by p(t). This measure is referred to as Mean Decrease Gini or Gini

Importance or generally known as Mean Decrease Impurity importance (MDI).

A very popular variable importance measure for regression forests is the average impurity

reduction. However, permuted accuracy importance measure is the most advanced measure

for selecting important variables in random forests. The idea of measuring the Mean

Decrease Accuracy (MDA) or Mean Increase Error of the forest to evaluate the importance

of variable Xj was also proposed by (Breiman, 2001, 2002). This measure is also referred

to as Permutation Importance since the values of the variable Xj are randomly permuted

in the out-of-bag samples (Louppe et al., 2013).

2.7 Problems in High Dimensions

As mentioned earlier, the problem of high dimension arises when the dataset contains a

huge number of features and only a small percentage of them are truly informative. This

results in the degeneration of the accuracy of the base classifiers (trees). Since at each node

simple random sampling is employed in selecting the subset of m candidate variables, non-

informative features would dominate the random subset of candidate splitting variables.

For instance, considering data, D = {(xi,yi) ∈ IRp × IR}ni=1 with total number of features

Θ, which contains only ω number of informative features, we want to compute the expected

number of informative features in a default mtry value in building a forest. Let η be the

number of features selected at each node of the tree by random sampling with equal

weights or probabilities. Then the probability distribution of selecting the number of

informative features (x) follows a binomial distribution with η number of trials. Probability

14

of selecting informative feature is p = ωΘ

and probability of selecting non-informative

features is q = 1− ωΘ

.

x ∼ Binomial(η, p)

Pr(x, η, p) =

(η

x

)pxqη−p

At each iteration, the expected number of informative features selected is µ = ηp. Since,

the proportion of informative is small, p < q, hence the expected number µ would also be

small.

For example, given that Θ = 10, 000 and ω = 100,

p =ω

Θ=

100

10000= 0.01

For regression we have that, η1 = Θ3

= 100003

= 3333

Similarly, for classification, η2 =√

Θ =√

10000 = 100

Where η1 and η2 are the default mtry values associated with regression and classification

trees respectively.

In the case of classification, E(x) = η2p = 100(0.01) = 1, and that of regression, E(x) =

η1p = 3333(0.01) ≈ 33. This implies that, the expected number of informative features

per node from the default mtry (thus candidate variables) is 1 and 33 for classification

and regression trees respectively. Therefore, the resultant base classifiers would have low

accuracies and overall affect the performance of the forest.

To better undertand this effect, we performed a simulation study to illustrate the per-

formance of random forest model in high dimensional data by analyzing the prediction

accuracy (Mean Square Errors) of RF. A training data D1 of size n = 500 and a test data

D2 of size n = 5000 were generated from a linear model, a nonlinear model (Multivariate

Adaptive Regression Splines model, MARS) and a tree model.

The first 5 predictors (x1, x2, . . . , x5) were considered the true variables and {xi}505i=6 were

noise variables where {xi}505i=6 ∼ N(0, 1). For each of the simulation models, two random

15

forest models were fitted, thus the oracle model and the naive model. The oracle model

was fitted with only the true predictors (p = 5) and the naive model was fitted with all

the predictors (p = 505) with 50 iterations. For each RF model (oracle and naive), we

considered a case where there is no correlation between covariates and the case where there

is correlation between covariates (ρ = 0) and (ρ = 0.5).

2.7.1 Simulation Models

For the purpose of performing a simulation study on the problem, data were simulated

from the three models below;

yi = β0 + β1x1 + β2x2 + β3x3 + β4x4 + β5x5 + εi (2.20)

yi = 10 sin(πx1x2) + 20

(x3 −

1

2

)2

+ 10x4 + 5x5 + εi (2.21)

yi = 5I(X1 ≤ 1) ∗ I(X2 ≤ 0.2) + 4I(X3 ≤ 1) ∗ I(X4 ≤ −0.2) + 3I(X5 ≤ 0) + εi (2.22)

where εi ∼ N(0, 1) and I(·) is an indicator function.

Model (2.20) is made up of a linear relationship between the true predictor variables

x1, x2, . . . , x5 and the response variable to examine the performance of RF model on high-

dimensional data consisting of a linear relationship between variables. The true specified

coefficients of model 2.20 were 0.5, 0.1,−0.4, 0.5, 1, 3 respectively. Covariates were gener-

ated with the standard normal distribution, {xi}505i=6 ∼ N(0, 1).

In model (2.21), the response variable Y is simulated using the Multivariate Adaptive Re-

gression Splines (MARS) model by Friedman (1991) with nonlinear association and variable

interactions among predictors to examine the performance of RF in high-dimensional data

when the target variable has a linear and nonlinear additive dependence. Finally, the re-

sponse variable of model (2.22) was simulated using the tree function. This model was also

employed in simulation to analyze and examine the performance of random forest model

in high-dimensional data when the response variable is modeled with tree function.

16

Graphical Display of Simulation Results

Figure 2.1: Prediction MSE for linear model when ρ = 0

Figure 2.2: Prediction MSE for linear model when ρ = 0.5

17

Figure 2.3: Prediction MSE for MARS model when ρ = 0

Figure 2.4: Prediction MSE for MARS model when ρ = 0.5

18

Figure 2.5: Prediction MSE for tree model when ρ = 0

Figure 2.6: Prediction MSE for tree model when ρ = 0.5

For each generated data set, two types of random forests were built. For the first type

that we refer to as the ’oracle model’, only those truly important variables were used. For

the second type that we call the ’naive model’, all variables were blindly used. The MSE

values were monitored during the training process of random forest as the number of trees

accumulates. A test sample of size n = 5, 000 were also sent down the fitted RF models

along the training process to obtain the prediction MSE measures. The resultant MSE

19

and prediction MSE measures for different models were plotted versus the number of trees

in the forest in Figures 2.1 − 2.6. The results allow us to inspect how both RF methods

perform and compare to each other as the number of trees increases.

In Figures 2.1 − 2.6, mse0 and mse represent the MSE values for the oracle and naive

models respectively whereas amse0 and amse represent the average MSE values over simu-

lation runs. From the graphs (Figures 2.1 − 2.6), we could examine that the models with

only the true predictors, oracle models (p = 5) have smaller prediction mean square errors

(light blues for runs and deep blue for average of runs by trees) compared to the models

containing the true predictors and noise variables, naive models (p = 505) (light reds for

runs and deep red for average of runs by trees). Moreover, the same true predictors were

used in both models but in the presence of noise variables, the performance of RF declined

substantially regardless of whether there is a correlation between covariates or not.

2.8 Related Works

In recent times, many studies on the problems associated with high dimensions have been

conducted by researchers. As mentioned earlier, having a small portion of important pre-

dictors mixed with massive noise variables is one factor that contributes to the problems of

high dimensions. Using simple random sampling in selecting the subset of features at each

node would likely contain huge amount of non-informative features and leads to a degrade

in the performance of the base learners (trees). A method to remedy this problem was

proposed by Amaratunga et al. (2008). Instead of simple random sampling, this method

introduced a weighted random sampling method at each node of each tree. According to

their approach, weights {wi} were imposed on the genes (variables) so that less informative

genes are less likely to get selected. The odds of trees containing more informative features

being included in the forest increases. Features were scored by testing each feature for a

group mean effect using a two-sample t-test and q-value (Storey and Tibshirani, 2003) cal-

culated from the p-values as: qi = mink≥1

{min((Θ

k)p(k), 1)

}where Θ is the total number

20

of features, p(k) are the p-values and q(i) are the q-values related to the feature with the

k-th smallest p-value.

However, the incorporation of co-data in replacing the uniform sampling probabilities has

been used in some recent works of high-dimensional random forests. Te Beest et al. (2017)

introduced Co-data moderated Random Forest (CoRF) where co-data are incorporated in

random forest by replacing the uniform sampling probabilities that are used to draw can-

didate variables with co-data moderated sampling probabilities (by using empirical Bayes

estimation).

Random forest has proven to have low performance when dealing with very high-dimensional

data that consists of dependencies. The general random forest method does not efficiently

exploit the cases where there exist many combinations among variables as a result of de-

pendencies among variables. Do et al. (2010) extended the initial idea of Do et al. (2009) by

proposing a random oblique decision tree method that uses linear proximal SVMs to per-

form multivariate node splitting at the tree growing stage in order for the trees to make use

of dependencies among features. This method was successful in producing stronger indi-

vidual trees than the usual forest method that picks only a single feature for node splitting

at each node. Their proposed method has significant improvement on the performance

of random forest on very high-dimensional data and slightly better on lower dimensional

data.

High-dimensional datasets often consist of highly correlated features which affect random

forest’s ability to identify the strongest features by reducing the estimated variable im-

portance scores of correlated features (Gregorutti et al., 2017). Gregorutti et al. (2017)

proposed a solution known as Random-Forest-Recursive Feature Elimination (RF-RFE)

algorithm for variable selection criterion (Darst et al., 2018). Permutation variable im-

portance scores was employed as a ranking criterion to recursively eliminates variables

with backward elimination strategies. Wright and Ziegler (2015) developed a fast imple-

mentation package for high diemnsioanl data in C++ and R known as RANdom forest

GEneRator (Ranger). This package focuses more on the implementation efficiency and

21

memory issues when dealing with high dimensions. A recent implementation of package

for high-dimensional random forest available in R is the Rborist package (Seligman, 2015)

and a novel software package known as Random Jungle (RJ) available in C++ (Schwarz

et al., 2010). RJ was written in C++ language and it was implemented with all the fea-

tures in the original randomForest implementation by Breiman and Cutler (2004) written

in Fortran 77.

22

Chapter 3

High-Dimensionl Random Forests

with Ridge Regression Based

Variable Screening

In this chapter, we present the proposed method for developing random forest models

with high-dimensional data or data that contain a large amount of noise variables. In

this project, we try to come up with a novel feature screening method before fitting a

random forest model on basis of ridge regression. A high dimensional data will often have

many candidate variables for splitting variable selection. Most of these candidate variables

would be noninformative or have no predictive power. Therefore, we propose using ridge

estimators as a variable screener and applying original random forest to a top portion of

the screened features.

3.1 Ridge Estimator

The multicollinearity problems associated with the ordinary least square estimate moti-

vated the idea of ridge regression. Ridge regression basically introduces an `2 penalty on

the regression coefficients to penalize the least square loss.

Considering the general linear regression model;

Y = β0 +

p∑i=1

Xiβi + εi (3.1)

23

where X1, . . . , Xp are p predictor variables and εi are random errors with mean 0 and vari-

ance 1 (εi ∼ N(0, 1)). The goal of this linear model is to estimate the unknown regression

parameters β = (β0,β1, . . . ,βp)T .

Recall that the ordinary least square fitting procedure estimates of the regression param-

eters β = (β0,β1, . . . ,βp)T by minimizing the residual sum of square;

RSS = minn∑i=1

(yi − β0 −

p∑j=1

βjxij

)2

(3.2)

with absurdly high variance due to the singularity of matrix XTX. The least square esti-

mates can explode with absurdly high variance due to the singularity of matrix XTX. To

control these variations, a ridge contraint (penalty) is imposed to regularize the coefficients

and regularize how large the coefficients grow.

The ridge regression coefficient estimates βR =(βR0 , β

R1 , . . . , β

Rp

)Tare the values that

minimize the resisual sum of squares;

minn∑i=1

(yi − β0 −

p∑j=1

βjxij

)2

+ λ

p∑j=1

β2j

= RSS + λ

p∑j=1

β2j

(3.3)

where λ ≥ 0 is the tuning parameter .

The above penalized LS form can be equivalently written into the constraint optimization

problem as follows,

minn∑i=1

(yi − zTi β

)2s.t.

p∑j=1

β2j ≤ t

⇔ min (y − Zβ)T (y − Zβ) s.t.

p∑j=1

β2j ≤ t

(3.4)

24

where Z is standardized variables with mean 0, and variance 1.

Following the penalized residual sum of squares;

PRSS(β) =n∑i=1

(yi − zTi β

)2+ λ

p∑j=1

β2j

= (y − Zβ)T (y − Zβ) + λ||β||22

(3.5)

Taking the derivative, we obtain

∂PRESS(β)

∂β= −2ZT (y − Zβ) + 2λβ set

=0

ZT (y − Zβ)− λβ = 0

ZTy − β(ZTZ − λ) = 0

βRλ = (ZTZ + λIp)−1ZTy

where I is a p× p identity matrix.

Hence the ridge estimator of β is of the form

βRλ = (ZTZ + λIp)−1ZTy (3.6)

3.2 Ridge Penalty (λ)

The solution of the ridge estimator (3.6) is indexed by the ridge penalty parameter also

known as the tuning parameter λ, thus, each λ value produces a unique solution. This

penalty shrinks the coefficient estimate to 0 as the penalty paramter increases.

λ is the shrinking parameter and it does the following;

• the size of the ridge coefficients is controlled by λ

• λ also controls the amount of regularization

• the least square solution is obtained as λ→ 0. Thus, limλ→0

βRλ = βR0 = (XTX)−1XTY

25

• βRλ=∞ = 0 as λ→∞, (the model becomes an intercept-only model)

As clearly listed above, the tuning parameter λ plays an important role in the solution of

PRSS(β). However, in our proposed method we set the tuning parameter fixed at λ = ε

such that ε is any small number close to 0. A smaller λ value will lower the effect of

multicolinearity and at the same time produce an estimate that is as good as the Least

Square Estimate (LSE). This works because the ranking of variables is invariant to the

choice of λ as long as it is within some reasonable range in probabilistic sense. (More

justification would be provided in the next section). Also, initializing a fixed λ value helps

to improve the computational efficiency of the algorithm.

3.3 Variable Screening

Our interest is using ridge regression to screen the predictors in order to get rid of some noise

predictors that lead to the low performance of random forest model in high-dimensions.

The variable screening procedure consists of an initial stage that is effective in ranking the

predictors in order of importance. At the initial stage, a ridge regression model is fitted

on the entire data. The ridge estimates (coefficients) are sorted in thier absolute values

such that |βRj | ≥ |βRj′ | for j 6= j′ and a top portion (say α1%) of the predictors are selected

based on the sorted coefficients. In other words, (100− α1)% of the variables is truncated

in ascending order. After the initial stage of variable screening, random forest can be ap-

plied on the selected relevant variables. However, the algorithm is flexible in allowing for

multiple stages of variable screening depending on the dimension of variables and if the

selected portion is still dominated with noise variables.

In the second stage of the algorithm, another ridge model is fitted on the selected α1%

variables and the ridge estimates (coefficients) are sorted such that |βRj | ≥ |βRj′ | for j 6= j′.

At this stage, α2% of the predictors are selected based on the sorted coefficients. Equiv-

alently, (100 − α2)% of the predictors is truncated from the minimum to the maximum.

Therefore, the remaining αγ% of the variables will contain more relevant and less uninfor-

26

mative variables for a better random forest model. The flexibility of the algorithm allows

for multiple stages (M) of variable screening and the parameters {α1, α2, . . . , αM} can be

adjusted by the user.

Conjucture 1. Assuming the predictors xj’s are all normalized, if |βj| ≥ |βj′ |, then with

high probability, |βRj | ≥ |βRj′ | for any λ > 0 that may satisfy some conditions.

27

Input: Data, D = {(xi, yi) ∈ IRp × IR}ni=1

Data: Training data D1 = {(xi, yi) ∈ D} for ridge regression

1 begin

2 Using Training Set D1

3 set tuning parameter λ = ε

4 set percentage p = α1, . . . , αM

5 for p→ 1 to M do

6 Fit a Ridge Regression model RG

7 Extract the coefficients βRi , for i = 1, ..., P

8 Sort |βRi | such that βRI,...,J = |βRi | > |βRj | for i 6= j

9 Select first p(m) of βRI,...,J

10 Extract features from D1 w.r.t selected βRI,...,J thus D11

11 Using new Training Set D11

12 Repeat steps (6)− (10) until p = αM in step (9)

13 Set final reduced data to Dred1 for Training Set

14 end

15 end

Data: New Training data Dred1 = {(xi, yi) ∈ D1}

16 set number of trees, ntree = T

17 for t→ 1 to T do

18 Fit a Random Forest model RF with T trees using Dred1

19 end

Result: Random Forest Model RF with αγ% important Ridge Regression features

Algorithm 1: Algorithm for the Proposed Method.

28

Chapter 4

Simulation Study

In this chapter, we present a series of simulation studies to explore and investigate the

performance of the proposed methods. We illustrate the capability of ridge regression in

screening variables and apply random forest to a top α% of relevant predictors. The aim

is to demonstrate the effectiveness of the proposed method in handling high-dimensional

data. Datasets of high dimensions with different sizes and characteristics were generated

and used in these experiments. The results were compared to the available standard

random forest method via Prediction Mean Square Errors (PMSE).

4.1 Simulation Models

A high-dimensional data with training data D1 of size n = 500 and a test data D2 of size

n = 5000 were generated from the linear model (2.20) and tree model (2.22) presented in

Section 2.7.1.

The first 5 predictors (x1, x2, . . . , x5) were considered the true variables. For each simu-

lation model, 500 additional noise predictors {xi}505i=6 were randomly generated using the

standard normal distribution, thus N(0, 1). These predictors were considered to be noise

predictors because they have no or less significant contribution in predicting the response

variable Y . For each of the simulation models, three random forest models were fitted, i.e.,

the oracle model, naive model and the proposed model. The oracle model was fitted with

only the true predictors (p = 5), the naive model was fitted with all the predictors (p = 505)

and the proposed model was fitted with the screened variables. For each RF model (oracle,

naive and proposed), we considered a case where there is no correlation between covariates

29

and the case where there is correlation between covariates, that is ρ ∈ {0, 0.5}. A total of

100 simulation iterations (runs) were performed with both the training set and test set of

data by growing 500 trees for each run.

Two variable screening stages of the algorithm were performed for the proposed model.

At the first stage, 30% of the predictors were selected based on the sorted ridge regres-

sion coefficients |βRj | ≥ |βRj′ | for j 6= j′, (i.e., α1 = 30%). In the second stage of variable

screening, 50% out of the initial selected 30% of the predictors were selected based on the

|βRj∗| ≥ |βRj∗′ | for j∗ 6= j∗′, (i.e., α2 = 50%). The variable screening resulted in a total

selection of 15% of the data D1 and D2 that contains higher proportion of informative

variables, 50100

(30100D)

= 15100D = 15%D. After the second stage of variable screening, the

randomForest package in R was applied on the 15% relevant variables; see Liaw et al.

(2002), RColorBrewer and Liaw (2018) for random forest implementation in R.

4.2 Comparing Performance of the Proposed Model

to the Existing Model

The performance of the proposed method is compared to that of the classical random

forest method. In this regard, we employed the prediction mean square errors (MSE) for

the runs and the average prediction mean square errors (AMSE) as performance measures.

The MSE is computed by equation (2.15) and the AMSE is computed by equation (2.18)

both presented in Section 2.5. For each of the models (oracle, naive and proposed), their

mean squared errors were computed by number of trees (T = 500) and runs (R = 100).

Table 4.1, 4.2, 4.3 and 4.4 present the comprehensive results on the average mean squared

error (2.18) by runs R. Figure 4.1, 4.2, 4.3 and 4.4 presented below also gives a visual

presentation of the mean squared errors by runs and trees (2.17) and the average mean

squared errors by runs (2.18) for the linear model and figure 4.5, 4.6, 4.7 and 4.8 present

that of the tree model. The red, green and blue points represent the MSEs and AMSEs of

30

the naive model, oracle model and the proposed model respectively.

It can be observed from the tables and figures presented below that the MSE and AMSE

values of the proposed model are smaller than the oracle and naive model. Also, the MSE

stabilizes as the number of tree increase, (ntrees→∞). This gives the evidence that gener-

ally the performance of a forest improves with large number of trees. In terms of precision

and performance, it can be clearly observed that the proposed method performs better

than the existing random forest method in dealing with high dimensional data. Compar-

ing the cases where ρ ∈ {0, 0.5}, the proposed method has a better performance regardless

of the presence of multicolinearity. Moreover, by the mean squared errors (2.15), it is clear

that applying ridge regression in variable screening produces a top relevant/informative

variables for a better random forest model in the high dimensional data setting.

31

4.2.1 Simulation Results from Linear Model

Table 4.1: Average Prediction MSE Values for linear model 2.20 with ρ = 0

Average Prediction MSE Values for ρ = 0, runs = 100Oracle Naive Proposed

Trees Run amse0 amse0.test amse amse.test amse1 amse1.test1 1 9.04 8.62 9.51 9.14 8.82 8.602 1 7.91 5.26 8.74 5.79 7.91 5.393 1 7.23 4.31 7.91 4.64 6.95 4.174 1 6.75 3.85 7.45 4.18 6.35 3.695 1 6.22 3.56 6.94 3.84 5.94 3.42...

......

......

......

...249 1 2.74 2.41 2.95 2.56 2.44 2.24250 1 2.74 2.41 2.95 2.56 2.44 2.24251 1 2.74 2.41 2.95 2.56 2.44 2.24252 1 2.74 2.41 2.95 2.56 2.44 2.24

......

......

......

......

497 1 2.71 2.41 2.92 2.55 2.41 2.22498 1 2.71 2.41 2.92 2.55 2.41 2.22499 1 2.71 2.41 2.92 2.55 2.41 2.22500 1 2.71 2.41 2.92 2.55 2.41 2.22

Figure 4.1: Prediction MSE and AMSE for the simulation runs of the three models: Naive(mse, amse), Oracle (mse0, amse0), and Proposed (mse1, amse1) for training dataset withρ = 0.

32

Figure 4.2: Prediction MSE and AMSE for the simulation runs of the three models: Naive(mse, amse), Oracle (mse0, amse0), and Proposed (mse1, amse1) for test dataset withρ = 0.

Table 4.2: Average Prediction MSE Values for linear model 2.20 with ρ = 0.5

Avarage Prediction MSE Values for ρ = 0.5, runs = 100Oracle Naive Proposed


......

......

......

...249 1 2.13 2.15 2.14 2.17 1.84 1.89250 1 2.13 2.15 2.14 2.17 1.84 1.89251 1 2.13 2.15 2.14 2.17 1.84 1.89252 1 2.13 2.15 2.14 2.17 1.84 1.89

......

......

......

......

497 1 2.11 2.14 2.12 2.16 1.82 1.88498 1 2.11 2.14 2.12 2.16 1.82 1.88499 1 2.11 2.14 2.12 2.16 1.82 1.88500 1 2.11 2.14 2.12 2.16 1.82 1.88

33

Figure 4.3: Prediction MSE and AMSE for the simulation runs of the three models: Naive(mse, amse), Oracle (mse0, amse0), and Proposed (mse1, amse1) for training dataset withρ = 0.5.

Figure 4.4: Prediction MSE and AMSE for the simulation runs of the three models: Naive(mse, amse), Oracle (mse0, amse0), and Proposed (mse1, amse1) for test dataset withρ = 0.5.

34

4.2.2 Simulation Results from Tree Model

Table 4.3: Average Prediction MSE Values for tree model 2.22 with ρ = 0

Average Prediction MSE Values for ρ = 0, runs = 100Oracle Naive Proposed


......

......

......

...249 1 2.33 2.11 3.05 2.57 2.50 2.21250 1 2.33 2.11 3.04 2.56 2.50 2.21251 1 2.33 2.11 3.05 2.56 2.50 2.21252 1 2.33 2.11 3.05 2.57 2.50 2.21

......

......

......

......

497 1 2.30 2.09 3.01 2.55 2.46 2.20498 1 2.30 2.09 3.01 2.55 2.46 2.20499 1 2.30 2.09 3.00 2.55 2.46 2.20500 1 2.30 2.09 3.00 2.55 2.46 2.20

Figure 4.5: Prediction MSE and AMSE for the simulation runs of the three models: Naive(mse, amse), Oracle (mse0, amse0), and Proposed (mse1, amse1) for training dataset withρ = 0.

35

Figure 4.6: Prediction MSE and AMSE for the simulation runs of the three models: Naive(mse, amse), Oracle (mse0, amse0), and Proposed (mse1, amse1) for test dataset withρ = 0.

Table 4.4: Average Prediction MSE Values for tree model 2.22 with ρ = 0.5

Average Prediction MSE Values for ρ = 0.5, runs = 100Oracle Naive Proposed


......

......

......

...249 1 2.04 2.00 2.26 2.13 1.99 1.94250 1 2.04 2.00 2.26 2.13 1.99 1.94251 1 2.04 2.00 2.26 2.13 1.99 1.94

......

......

......

......

497 1 2.01 1.99 2.23 2.11 1.96 1.93498 1 2.01 1.99 2.23 2.11 1.95 1.93499 1 2.01 1.99 2.23 2.11 1.95 1.93500 1 2.01 1.99 2.23 2.11 1.95 1.93

36

Figure 4.7: Prediction MSE and AMSE for the simulation runs of the three models: Naive(mse, amse), Oracle (mse0, amse0), and Proposed (mse1, amse1) for training dataset withρ = 0.5.

Figure 4.8: Prediction MSE and AMSE for the simulation runs of the three models: Naive(mse, amse), Oracle (mse0, amse0), and Proposed (mse1, amse1) for test dataset withρ = 0.5.

37

Chapter 5

Real Data Exploration

This chapter presents the analysis results on a well known dataset sourced from the

UCI Machine Learning Resipository to illustrate the proposed method in handling high-

dimensional data. Though the original data are not of high dimension, a large number of

noise (redundant) variables simulated from the standard normal distribution were added

to the dataset. The results from the experiment indicate that the proposed method signif-

icantly outperform Breiman (2001) original random forest algorithm.

5.1 Communities and Crime Dataset

The communities and crime dataset is a multivariate data that combine socio-economic

data from the 1990 US Census, law enforcement data from the 1990 US LEMAS survey,

and crime data from the 1995 FBI UCR. It involves communities within the United States.

However, the dataset is made up of many variables that were related to the target variable.

The attributes were picked based on their plausible connections to crime. It consists of

122 predictors and the attribute to be predicted (Per Capita Violent Crimes). The fea-

tures included in the crime dataset involve the community, such as the percentage of the

population considered urban, the median family income, and involving law enforcement,

such as per capita number of police officers, and percentage of officers assigned to drug

units. The target variable was calculated using the population of the communitites and

variables that are considered violent crimes in the United States, such as murder, rape,

assault and robbery. The data was normalized into the decimal range 0.00− 1.00 using an

unsupervised equal-interval binning method. Additional variables {Xi}700i=1 were randomly

38

generated from the standard normal distribution, {Xi}700i=1 ∼ N(0, 1). These variables are

unrelated or uninformative to the target variable, hence are considered as noise variables.

The final data consisted of 822 variables including 1 response variable with 1994 observa-

tions (cases). The dataset was thoroughly cleaned for analyses purposes and partitioned

into a training set and test set in that ration 2 : 1. Variables with less than 50% missing

values were imputated using the Multivariate Imputations by Chained Equations (MICE)

R package, whereas variables with more that 50% missing values were removed.

The analysis were carried out with our proposed method and its performance was compared

to that of the standard random forest method. Performance measures such as predicted

mean square errors (MSE) 2.15 and percentage of variation explained 2.16 were used.

Ridge estimator was applied in variable screening with α1 = 30% and α2 = 50%, to obtain

the informative predictors. In each stage of variable screening, predictors were selected

based on the sorted ridge estimates, |βRj | ≥ |βRj′ | for j 6= j′. However, the selected 15%

of the total variables hold more predictive power for a better random forest model. Then

randomForest package in R was applied on the 15% relevant variables. The performance

of the proposed method and existing method were asseessed via the mean squared errors

and percentage of variation explained by the models. In summary, the naive random forest

method was fitted with all the predictors (p = 821) whereas the proposed random forest

method was with 15% relevant variables, that is p = 123 variables.

39

5.1.1 Presentation of Results from Real Data Application

Table 5.1: Results of analysis of Communities and Crime Dataset. Comapring the proposedmethod to the existing (naive) random forest method in dealing with high dimensions.

Models Dataset No. of Trees MSE % Var Explained

Training Set 500 0.02052 61.85Naive Method

Test Set 500 0.0500 12.24

Training Set 500 0.01952 63.72Proposed Method

Test Set 500 0.0200 65.24

Table 5.2: Prediction MSE values of training and test set for the proposed method andexisting (naive) random forest method.

Prediction MSE Values by TreesNaive Method Proposed Method

Trees Training MSE Test MSE Training MSE Test MSE1 0.0448 0.0663 0.0369 0.04282 0.0420 0.0525 0.0353 0.03043 0.0438 0.0484 0.0340 0.02754 0.0424 0.0541 0.0333 0.02545 0.0397 0.0550 0.0329 0.0225...

......

......

249 0.0207 0.0476 0.0197 0.0193250 0.0207 0.0475 0.0197 0.0193251 0.0207 0.0474 0.0197 0.0193252 0.0207 0.0474 0.0197 0.0193

......

......

...497 0.0205 0.0485 0.0195 0.0192498 0.0205 0.0484 0.0195 0.0192499 0.0205 0.0484 0.0195 0.0192500 0.0205 0.0484 0.0195 0.0192

40

Figure 5.1: Out-of-Bag Estimate Errors of both models

Figure 5.2: Comparing Prediction MSE for Training and Test Sets

Tables 5.1 and 5.2 present the performance measures (mean squared errors and per-

centage of variations explained) of the existing RF model (mse1) and the proposed method

(mse2). These results can be better visualized in figure 5.1 and 5.2. It is clearly observed

41

from the results above (figure 5.1 and 5.2) that the out-of-bag estimates of errors and MSEs

of the proposed method (i.e., mse2) are lesser than that of the existing method (naive, i.e.,

mse1) by Breiman (2001). The percentage of variation explained after variable screening

via ridge estimator increases significantly as compared to the existing method.

42

Chapter 6

Discussion and Conclusion

6.1 Summary of Findings

In this study, we have presented a variable screening method for informative feature se-

lection associated with high dimensional data when building random forest models. This

new random forest algorithm integrates ridge regression as a variable screening tool to

obtain informative predictors in the setting of high dimensions. The algorithm allows for

multiple stages (M) of variable screening by specifying a percentage (αM) of variable to be

selected each stage. It can be extended to multi-class classification problems and data that

contains noisy variables to retain a higher proportion of informative variables for improved

RF modeling.

A preliminary simulation study was carried out to illustrate and assess the problems associ-

ated with high-dimensional random forests. The study has demonstrated that the presence

of massive noisy variables significantly affects the performance and precision of random for-

est. This result supports the works of Xu et al. (2012), Amaratunga et al. (2008), and Darst

et al. (2018) on how the effects of data dominated with a large number of uninformative

variables and correlated predictors can significantly influence model performances. The

empirical results have revealed how the small number of expected informative variables in

an mtry value could lead to the challenge of obtaining an accurate or robust random forest

model, which supports the findings of Amaratunga et al. (2008).

An extension of simulation studies was further employed to test how our proposed algo-

rithm maneuvers this issue by first screening variables to retain relevant variables without

compromising but rather improving the performance of random forest models. To better

43

understand how our method works, we also included the naive and oracle methods, which

allows us to make meaningful comparisons via some performance measures (e.g., MSE).

Based on the performance measures of these models (mean squared errors), the proposed

method clearly outperforms the naive random forest approach.

In an attempt to further validate the results, we applied our proposed method to a real

life dataset (Communities and Crime Dataset), which was sourced from the UCI database.

From the results presented in table 5.1 and 5.2, there was a significant improvement in the

mean squared errors. In particular, the percentage of variation explained by the proposed

method compared to naive model was quite interesting as we saw a a remarkable increase

in R2 from 61.85% to 63.72% for the training set and 12.24% to 65.24% with the test set.

In conclusion, our results have shown how variable screening using ridge regression could

be a very useful tool for handling high-dimensional random forests. The revelance of this

work serves as a foundation to further research on variable screening at the tree and splits

levels of random forest implementation.

6.2 Future Works

One limitation of our proposed method is when there is a nonlinear association between the

predictors and response or target variable. The performance of the method is compromised

when there is nonlinear and interaction effect among variables. We considered one ad hoc

approach which is to include spline basis terms for each variable with the same number of

degrees of freedom (v), like fitting Generalized Additive Model (GAM) but it remains a

linear model. Here the df (v) can be treated as a tuning parameter. However, due to time

constraints, this limitation will be addressed in future work.

Moreover, in this research we considered a variable screening at the forest level which is

a preliminary step to future studies outlined below. Future works will seek to implement

tree-level and split-level variable screening in HD random forests. We plan to continue

developing this research work in the directions below:

44

1. (Tree-Level): By repeating the ridge estimators approach with each or a few boot-

strap samples in random forests and combine results with ensemble learning (by using

rpart() package in R).

2. (Split-Level): By applying variable screening/selection/weighting to each split. The

selection can be made to variables or splits and for nonlinear cases we can include

spline terms.

In conclusion, these further studies would go a long way to implement a more robost

random forest model that could better handle high dimensional data.

45

References

Amaratunga, D., Cabrera, J., and Lee, Y.-S. (2008). Enriched random forests. Bioinfor-

matics, 24(18):2010–2014.

Amit, Y. and Geman, D. (1997). Shape quantization and recognition with randomized

trees. Neural computation, 9(7):1545–1588.

Berdugo, E., Bustos, D., Gonzalez, J., Palacio, A., Perez, J., and Rocha, E. (2020). Iden-

tification and comparison of factors associated with performance in the saber pro and

saber tyt exams for the period 2016-2019.

Bernard, S., Heutte, L., and Adam, S. (2010). A study of strength and correlation in

random forests. In International Conference on Intelligent Computing, pages 186–191.

Springer.

Biau, G. (2012). Analysis of a random forests model. The Journal of Machine Learning

Research, 13(1):1063–1095.

Breiman, L. (1996a). Bagging predictors. Machine learning, 24(2):123–140.

Breiman, L. (1996b). Out-of-bag estimation.

Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.

Breiman, L. (2002). Manual on setting up, using, and understanding random forests v3.

1. Statistics Department University of California Berkeley, CA, USA, 1:58.

Capitaine, L., Genuer, R., and Thiebaut, R. (2020). Random forests for high-dimensional

longitudinal data. Statistical Methods in Medical Research, page 0962280220946080.

46

Darst, B. F., Malecki, K. C., and Engelman, C. D. (2018). Using recursive feature elim-

ination in random forest to account for correlated variables in high dimensional data.

BMC genetics, 19(1):1–6.

Denil, M., Matheson, D., and De Freitas, N. (2014). Narrowing the gap: Random forests in

theory and in practice. In International conference on machine learning, pages 665–673.

PMLR.

Dietterich, T. G. (1998). An experimental comparison of three methods for constructing

ensembles of decision trees: Bagging, boosting and randomization. Machine learning,

32:1–22.

Dietterich, T. G. (2000). An experimental comparison of three methods for constructing

ensembles of decision trees: Bagging, boosting, and randomization. Machine learning,

40(2):139–157.

Do, T.-N., Lallich, S., Pham, N.-K., and Lenca, P. (2009). Un nouvel algorithme de forets

aleatoires d’arbres obliques particulierement adapte a la classification de donnees en

grandes dimensions. In EGC, pages 79–90.

Do, T.-N., Lenca, P., Lallich, S., and Pham, N.-K. (2010). Classifying very-high-

dimensional data with random forests of oblique decision trees. In Advances in knowledge

discovery and management, pages 39–55. Springer.

Friedman, J. H. (1991). Multivariate adaptive regression splines. The annals of statistics,

pages 1–67.

Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine.

Annals of statistics, pages 1189–1232.

Gregorutti, B., Michel, B., and Saint-Pierre, P. (2017). Correlation and variable importance

in random forests. Statistics and Computing, 27(3):659–678.

47

Ho, T. K. (1995). Random decision forests. In Proceedings of 3rd international conference

on document analysis and recognition, volume 1, pages 278–282. IEEE.

Kulkarni, V. Y. and Sinha, P. K. (2012). Pruning of random forest classifiers: A survey

and future directions. In 2012 International Conference on Data Science & Engineering

(ICDSE), pages 64–68. IEEE.

Liaw, A., Wiener, M., et al. (2002). Classification and regression by randomforest. R news,

2(3):18–22.

Louppe, G. (2014). Understanding random forests: From theory to practice. arXiv preprint

arXiv:1407.7502.

Louppe, G., Wehenkel, L., Sutera, A., and Geurts, P. (2013). Understanding variable

importances in forests of randomized trees. Advances in neural information processing

systems 26.

RColorBrewer, S. and Liaw, M. A. (2018). Package ‘randomforest’. University of Califor-

nia, Berkeley: Berkeley, CA, USA.

Schwarz, D. F., Konig, I. R., and Ziegler, A. (2010). On safari to random jungle: a fast im-

plementation of random forests for high-dimensional data. Bioinformatics, 26(14):1752–

1758.

Seligman, M. (2015). Rborist: extensible, parallelizable implementation of the random

forest algorithm. R Package Version, 1(3):1–15.

Storey, J. D. and Tibshirani, R. (2003). Statistical significance for genomewide studies.

Proceedings of the National Academy of Sciences, 100(16):9440–9445.

Strobl, C., Boulesteix, A.-L., Zeileis, A., and Hothorn, T. (2007). Bias in random forest

variable importance measures: Illustrations, sources and a solution. BMC bioinformatics,

8(1):1–21.

48

Te Beest, D. E., Mes, S. W., Wilting, S. M., Brakenhoff, R. H., and van de Wiel, M. A.

(2017). Improved high-dimensional prediction with random forests by the use of co-data.

BMC bioinformatics, 18(1):1–11.

Tibshirani, R. (1996). Bias, variance and prediction error for classification rules. Citeseer.

Wolpert, D. H. and Macready, W. G. (1999). An efficient method to estimate bagging’s

generalization error. Machine Learning, 35(1):41–55.

Wright, M. N. and Ziegler, A. (2015). ranger: A fast implementation of random forests for

high dimensional data in c++ and r. arXiv preprint arXiv:1508.04409.

Xu, B., Huang, J. Z., Williams, G., Wang, Q., and Ye, Y. (2012). Classifying very high-

dimensional data with random forests built from small subspaces. International Journal

of Data Warehousing and Mining (IJDWM), 8(2):44–63.

Zhang, J. and Zulkernine, M. (2006). A hybrid network intrusion detection technique

using random forests. In First International Conference on Availability, Reliability and

Security (ARES’06), pages 8–pp. IEEE.

49

Appendix

R Codes

1. Function for Simulating High-Dimensional Data

1 # ###########################################

2 # FUNCTIONS FOR HD RF

3 # ##########################################

4

5 require(MASS) # TO GENERATE MULTIVARIATE NORMAL DATA

6

7 rdat <- function(n=100, p0=10, beta0 , sigma=1, rho=0.5, s2=1,

8 model=c("linear", "MARS1", "MARS2", "Tree"))

9 {

10 p <- length(beta0) -1 # p - #PREDICTORS

11 mu <- rep(0, p)

12 Sigma <- s2*(matrix(rho , nrow=p, ncol=p) + diag(rep(1-rho , p)))

13 X <- mvrnorm(n =n, mu=mu, Sigma=Sigma)

14

15 if(NCOL(X) <5) stop("Please give me at least 5 predictors first.")

16 eps <- rnorm(n, mean=0, sd=sigma)

17 if (model=="linear") {

18 ymean <- cbind(1,X)%*%beta0

19 }

20

21 else if(model=="MARS1"){

22 ymean <- 0.1*exp(4*X[,1]) + 4/(1+exp(-20*(X[ ,2] -0.5))) + 3*X[,3] + 2

*X[,4] + X[,5]

23 }

24

25

50

26 else if(model=="MARS2"){

27 ymean <- 10*sin(pi*X[,1]*X[,2]) + 20*((X[ ,3] -0.5)^2) + 10*X[,4] + 5*

X[,5]

28 }

29

30 else{

31 ymean <- 5*I(X[,1]<=1)*I(X[ ,2] <=0.2) + 4*I(X[,3]<=1)*I(X[,4]<=-0.2)

+ 3*I(X[,5]<=0)

32 }

33

34 X0 <- matrix(rnorm(n*p0), nrow=n, ncol=p0) # NOISE PREDICTORS

35 y <- ymean + eps

36 dat <- data.frame(cbind(y, X, X0))

37 names(dat) <- c("y", paste("x", 1:(p+p0), sep=""))

38 return(dat)

39 }

2. Illustration of HD Problems

1 ####################################################################

2 # SIMULATION STUDIES DESIGNED TO ILLUSTRATE PROBLEMS IN RF WITH HD

3 ####################################################################

4

5 # =================

6 # SIMULATE DATA

7 # =================

8 ## Libraries Required

9 library(MASS)

10 library(tidyverse)

11 library(tibble)

12 library(gridExtra)

13 library(randomForest)

14 library(dplyr)

15 library(ggplot2)

16 source("Functions -HDRF.R")

51

17

18 set.seed (125)

19 beta0 <- c(0.5, 0.1, -0.4, 0.5, 1, 3)

20 n <- 500; n.test <- 5000

21 sigma <- 1; rho <- 0; s2 <- 1

22 p0 <- 500 # NUMBER OF NOISE PREDICTORS

23 mod <- "linear"; xcols.true <- 1:5

24 nrun <- 20

25 dat.test <- rdat(n=n.test , p0=p0 , beta0=beta0 , sigma=sigma , rho=rho ,

26 s2=s2 , model=mod)

27 y.test <- dat.test$y

28 X.test <- dat.test[, -1]

29 ntree <- 500; #mtry <- 2

30 MSE <- NULL

31 for (i in 1:nrun){

32 print(cbind(p0=p0 , rho=rho , run=i))

33 dat <- rdat(n=n, p0=p0 , beta0=beta0 , sigma=sigma ,

34 rho=rho , s2=s2 , model=mod)

35

36 ### TRUE VARIABLES (Oracle Method)

37 fit0.RF <- randomForest(y~x1 + x2 + x3 + x4 + x5 ,

38 data=dat , xtest=X.test[, xcols.true],

39 ytest=y.test , ntree=ntree , importance=FALSE)

40

41 ### ALL VARIABLES (Naive Method)

42 fit.RF <- randomForest(y~., data=dat , xtest=X.test ,


44

45 ### Extracting MSEs

46 Mse <- cbind(run=i, n.trees =1:ntree , mse0=fit0.RF$mse , mse0.test=

fit0.RF$test$mse ,

47 mse=fit.RF$mse , mse.test=fit.RF$test$mse)

48 MSE <- rbind(MSE , Mse)

49 }

52

50

51 ### Renaming Columns OF MSEs Table

52 MSE <- as_tibble(MSE)

53 names(MSE) <- c("run", "n.trees", "mse0", "mse0.test",

54 "mse", "mse.test")

55 MSE

56

57 # ==========================

58 # EXPLORE THE RESULTS

59 # ==========================

60 MSE %>%

61 group_by(n.trees) %>%

62 summarise(run=1, amse0=mean(mse0), amse0.test=mean(mse0.test),

63 amse=mean(mse), amse.test=mean(mse.test)) -> aMSE

64 aMSE

65

66 ### MSE PLOTS WITH LEGENDS

67 colors.le = c("mse0"="skyblue", "mse"="coral", "amse0"="blue",

68 "amse"="red")

69 colors.lee = c("mse0.test"="skyblue", "mse.test"="coral",

70 "amse0.test"="blue", "amse.test"="red")

71

72 # TRAINING DATA

73 fig.1 <- ggplot(data=MSE , aes(x=n.trees , by=factor(run))) +

74 geom_line(aes(y=mse0 , color="mse0"), size =0.2) +

75 geom_line(aes(x=n.trees , y=mse , color="mse"), size =0.2) +

76 geom_line(data=aMSE , aes(y=amse0 , color="amse0"), size =1) +

77 geom_line(data=aMSE , aes(y=amse , color="amse"), size =1) +

78 xlab("number of trees") + ylab("MSE") +

79 labs(title="(a) Training MSE") +

80 theme(plot.title = element_text(hjust =0.5) ,

81 legend.position = c(0.75, 0.60),

82 legend.key.size = unit (0.6, "cm"),

83 legend.text = element_text(colour = "black", size = 8)) +

53

84 scale_color_manual(values = colors.le) +

85 labs(colour="Models")

86

87 # TEST DATA

88 fig.2 <- ggplot(data=MSE , aes(x=n.trees , by=factor(run))) +

89 geom_line(aes(y=mse0.test ,color="mse0.test"), size =0.2) +

90 geom_line(aes(x=n.trees , y=mse.test , color="mse.test"), size

=0.2) +

91 geom_line(data=aMSE , aes(y=amse0.test , color="amse0.test"),

size =1) +

92 geom_line(data=aMSE , aes(y=amse.test , color="amse.test"), size

=1) +

93 xlab("number of trees") + ylab("pMSE") +

94 labs(title="(b) MSE with Test Data") +





99 scale_color_manual(values = colors.lee) +


101

102 grid.arrange(fig.1, fig.2, ncol=2, nrow =1)

54

3. Simulation studies on proposed method

1 ##########################################################

2 ## SIMULATION STUDIES DESIGNED TO FOR THE PROPOSED METHOD

3 ##########################################################

4 source("Functions -HDRF.R")

5

6 set.seed (125)

7 beta0 <- c(0.5, 0.1, -0.4, 0.5, 1, 3)

8 n <- 500; n.test <- 5000

9 sigma <- 1; rho <- 0.5; s2 <- 1

10 p0 <- 500 # NUMBER OF NOISE PREDICTORS

11 mod <- "linear"; xcols.true <- 1:5

12 nrun <- 100

13 dat <- rdat(n=n, p0=p0 , beta0=beta0 , sigma=sigma ,

14 rho=rho , s2=s2 , model=mod)

15 dat.test <- rdat(n=n.test , p0=p0 , beta0=beta0 ,

16 sigma=sigma , rho=rho ,

17 s2=s2 , model=mod)

18 y.test <- dat.test$y

19 X.test <- dat.test[, -1]

20 ntree <- 500; #mtry <- 2

21 lambda = 0.01 # Lambda value for ridge regression

22 MSE <- NULL

23

24 for (i in 1:nrun){

25 print(cbind(p0=p0 , rho=rho , run=i))

26

27 ### TRUE VARIABLES (Oracle Method)

28 fit0.RF <- randomForest(y~x1 + x2 + x3 + x4 + x5 ,

29 data=dat , xtest=X.test[, xcols.true],


31

32

55

33 ### ALL VARIABLES (Naive Method)

34 fit.RF <- randomForest(y~., data=dat , xtest=X.test ,


36

37 ### Variable Screening With Ridge regression

38 # Selecting best 15% importance variables

39 fit.ridge1 = lm.ridge(y~., data = dat , lambda = lambda)

40 beta.RR1 = fit.ridge1$coef

41 a1 = 0.30

42 pa1 = length(beta.RR1)

43 n.select1 = trunc(pa1*a1); #n.select

44 Xs.selected1 = names(sort(abs(beta.RR1), decreasing = TRUE)[1:n.

select1 ])

45 newdat1 = cbind(y=dat$y, dat[, Xs.selected1 ])

46 newX.test1 = X.test[Xs.selected1]

47


49 lambda2 = 0.01

50 fit.ridge2 = lm.ridge(y~., data = newdat1 , lambda = lambda2)


52 a2 = 0.50




select2 ])

56 newdat2 = cbind(y=dat$y, newdat1[, Xs.selected2 ])

57 newX.test2 = newX.test1[Xs.selected2]

58

59 ### PROPOSED METHOD

60 fitp.RF <- randomForest(y~., data=newdat2 , xtest=newX.test2 ,

61 ytest=y.test , ntree=ntree , importance=TRUE)

62

63

64

56

65 ## Mean Square Errors

66 Mse <- cbind(run=i, n.trees =1:ntree , mse0=fit0.RF$mse ,

67 mse0.test=fit0.RF$test$mse , mse=fit.RF$mse ,

68 mse.test=fit.RF$test$mse , mse1=fitp.RF$mse ,

69 mse1.test=fitp.RF$test$mse)

70 MSE <- rbind(MSE , Mse)

71 }

72

73 MSE <- as_tibble(MSE)

74 names(MSE) <- c("run", "n.trees", "mse0", "mse0.test", "mse",

75 "mse.test", "mse1", "mse1.test")

76 MSE

77

78 # ====================

79 # EXPLORE THE RESULTS

80 # ====================

81 MSE %>%

82 group_by(n.trees) %>%

83 summarise(run=1, amse1=mean(mse1), amse1.test=mean(mse1.test),

84 amse2=mean(mse2), amse2.test=mean(mse2.test)) -> aMSE

85 aMSE

86

87 #################################################

88 ######### PLOTS INCLUDING PROPOSED METHOD ########

89 #################################################

90

91 colors.le = c("mse1" = "green4", "mse2" = "blue")

92 colors.lee = c("amse1" = "green4", "amse2" = "blue")

93

94 # TRAINING MSE

95 fig1 <- ggplot(data=MSE , aes(x=n.trees , by=factor(run))) +




57

99 labs(title="(a) Training MSE") +







106

107 # AVERAGE TRAINING MSE

108 fig11 <- ggplot(data=aMSE , aes(x=n.trees , by=factor(run))) +

109 geom_line(aes(y=amse1 , color="amse1"), size =1) +

110 geom_line(aes(y=amse2 , color="amse2"), size =1) +


112 labs(title="(b) Average Training MSE") +







119

120 grid.arrange(fig1 , fig11 , ncol=2, nrow =1)

121

122 ###########################################################

123

124 colors.le.test = c("mse1.test" = "green4", "mse2.test" = "blue")

125 colors.lee.test = c("amse1.test" = "green4", "amse2.test" = "blue")

126

127 ## TEST MSE

128 fig2 <- ggplot(data=MSE , aes(x=n.trees , by=factor(run))) +

129 geom_line(aes(y=mse1.test , color="mse1.test"), size =0.2) +



132 labs(title="(a) Test MSE") +

58





137 scale_color_manual(values = colors.le.test) +


139

140 ## AVERAGE TEST MSE

141 fig22 <- ggplot(data=aMSE , aes(x=n.trees , by=factor(run))) +

142 geom_line(aes(y=amse1.test , color="amse1.test"), size =1) +

143 geom_line(aes(y=amse2.test , color="amse2.test"), size =1) +


145 labs(title="(b) Average Test MSE") +





150 scale_color_manual(values = colors.lee.test) +


152


59

4. Real Data Exploartion

1

2 ####################################################################

3 ################ REAL DATA EXPLORATION #############################

4 ####################################################################

5

6 # ====================================

7 # Communities and Crime Data Set

8 # ====================================

9 dat1 = read.csv(file = "communities.csv", header = TRUE , na="?")

10 head(dat1)

11 dim(dat1)

12 colSums(is.na(dat1))

13

14 ## Removing 5 non -predictive variables (state , county , community ,

communityname , and fold)

15 dat1 = select(dat1 , -c(state , county , community ,

16 communityname , fold , PolicPerPop))

17 dim(dat1)

18

19 ## CHECKING FOR MISSING VALUES

20 miss.info = function(dat , filename=NULL){

21 vnames = colnames(dat); vnames

22 n = nrow(dat)

23 out = NULL

24 for (j in 1: ncol(dat)){

25 vname = colnames(dat)[j]

26 x = as.vector(dat[,j])

27 n1 = sum(is.na(x), na.rm=T)

28 n2 = sum(x=="NA", na.rm=T)

29 n3 = sum(x=="", na.rm=T)

30 nmiss = n1 + n2 + n3

31 ncomplete = n-nmiss

60

32 out = rbind(out , c(col.number=j, vname=vname ,

33 mode=mode(x), n.levels=length(unique(x)),

34 ncomplete=ncomplete , miss.perc=nmiss/n))

35 }

36 out = as.data.frame(out)

37 row.names(out) = NULL

38 if (!is.null(filename)) write.csv(out , file = filename , row.names=F)

39 return(out)

40 }

41

42 miss.info(dat = dat1)

43

44 ## Removing Predictors with more than 50% missing values

45 dat = dat1[, -c(26, 97:112 , 116:119 , 121)]

46 dim(dat)

47 miss.info(dat=dat)

48

49 ## Data imputation for Predictors with less than 50% missing values

50 dat_impute = mice(dat , m=5, method = "pmm", maxit = 30, seed = 100)

51 dat_imputed1 = complete(dat_impute , 3) #Selecting the third

predicted imputation values.

52 dat1 = as.data.frame(dat_imputed1)

53 colSums(is.na(dat1))

54 miss.info(dat = dat1)

55

56 # Generating Multivariate Normal Data (Noise Variables)

57 n = nrow(dat)

58 p0 = 700

59 set.seed (100)

60 X0 = round(matrix(rnorm(n*p0), nrow=n, ncol=p0), 2) # Noise

61 dat = data.frame(cbind(dat , X0))

62 head(dat)

63 dim(dat)

64

61

65 ## Data partitioning

66 set.seed (100)

67 sample = sample.split(dat$ViolentCrimesPerPop , SplitRatio = (2/3))

68 training = subset(dat , sample == TRUE)

69 test = subset(dat , sample == FALSE)

70 dim(training)

71 dim(test)

72

73 #############################################

74 ############# MODELS FITTING #################

75 #############################################

76 y.test = test$ViolentCrimesPerPop

77 X.test = test[, -test$ViolentCrimesPerPop]

78 ntree = 500

79 lambda = 0.1 # Lambda value for ridge regression

80

81 ## Ridge regression


83 fit.ridge1 = lm.ridge(ViolentCrimesPerPop~., data = training , lambda

= best.lambda)


85 a1 = 0.30




select1 ]); #Xs.selected

89 newdat1 = cbind(ViolentCrimesPerPop=training$ViolentCrimesPerPop ,

training[, Xs.selected1 ])

90 newX.test1 = X.test[Xs.selected1]

91


93 fit.ridge2 = lm.ridge(ViolentCrimesPerPop~., data = newdat1 , lambda

= best.lambda)


62

95 a2 = 0.50




select2 ]); #Xs.selected

99 newdat2 = cbind(ViolentCrimesPerPop=training$ViolentCrimesPerPop ,

newdat1[, Xs.selected2 ])

100 newX.test2 = newX.test1[Xs.selected2]

101 dim(newdat2)

102

103 #### FITTING THE PROPOSED RF MODEL

104 mtry1 <- tuneRF(newdat2[ , -1], newdat2$ViolentCrimesPerPop ,

105 ntreeTry =100, stepFactor =2, improve =0.05 , trace=TRUE ,

106 plot=FALSE , dobest=FALSE)

107 best.mtry1 = mtry1[mtry1[, 2] == min(mtry1[, 2]), 1]

108 best.mtry1

109

110 fit.RF2 <- randomForest(ViolentCrimesPerPop~., data=newdat2 ,

111 xtest=newX.test2 , ytest=y.test , mtry=best.mtry1 ,

112 ntree=ntree , importance=TRUE)

113 print(fit.RF2)

114 varImpPlot(fit.RF2)

115

116 #### FITTING THE NAIVE RF MODEL

117 mtry <- tuneRF(training[ , -100], training$ViolentCrimesPerPop ,

118 ntreeTry =100, stepFactor =2, improve =0.05 , trace=TRUE ,

119 plot=FALSE , dobest=FALSE)

120 best.mtry = mtry[mtry[, 2] == min(mtry[, 2]), 1]

121

122 fit.RF1 <- randomForest(ViolentCrimesPerPop~., data=training ,

123 xtest=X.test , ytest=y.test , mtry=best.mtry ,

124 ntree=ntree , importance=TRUE)

125 print(fit.RF1)

126 varImpPlot(fit.RF1)

63

127 #=======================================

128 #============== PLOTS MSE ===============

129 #=======================================

130 colors.le = c("mse1.training" = "green4", "mse2.training" = "blue")

131 colors.lee = c("mse1.test" = "green4", "mse2.test" = "blue")

132

133 # TRAINING MSE

134 fig1 <- ggplot(data=MSE , aes(x=n.trees)) +

135 geom_line(aes(y=mse1.training , color="mse1.training"),

136 size =0.8) +

137 geom_line(aes(y=mse2.training , color="mse2.training"),

138 size =0.8) +


140 labs(title="(a) MSE For Training Set") +







147

148 # TEST MSE

149 fig2 <- ggplot(data=MSE , aes(x=n.trees)) +




153 labs(title="(b) MSE For Test Set") +








64

161

162 ### Out -of-Bag Error Plots for the two Models

163 plot2 = plot(fit.RF2 , main = "OOB Estimate Errors For Naive and

Proposed RF Models", col="blue", type = "l", lwd =1.8)

164 plot(fit.RF1 , add=TRUE , col="green4", type = "l", lwd =1.8)

165 legend("topright", legend=c("Naive Model", "Proposed Model"),

166 col=c("green4", "blue"), lty=1:1, cex=1.2, lwd=2)

65

Curriculum Vitae

Roland Fiagbe was born on September 15, 1993, the first child of the late Mr. John Kwaku

Fiagbe and Janet Karikari. He is a past student of Osei Kyeretwie Senior High school in

Kumasi - Ghana, who graduated in 2013. In 2014, he then entered Kwame Nkrumah

University of Science and Technology where he obtained his Bachelor’s degree in Statistics

with first class honors in 2018. During Roland’s time in school he served as a financial

committee member for the Association of Mathematics and Statistics Students (AMSS) in

2015−2016, treasurer for the Association of Mathematics and Statistics Students (AMSS)

in 2016− 2017 and treasurer for the Ghana National Association of Mathematics Students

(NAMS-GH) in 2018− 2019. After completion of his Bachelor’s degree, he worked at the

Electoral Commission of Ghana, Ashanti regional office in Kumasi from 2018 − 2019. At

the Commission, Roland was actively involved in data analyses and administrative duties.

On the strength of his outstanding academic performance, and desire to pursue higher

education, he entered into a graduate program at The University of Texas at El Paso in

Fall 2019 to pursue a master’s degree in Statistics. While pursuing his master’s degree, he

worked as a Graduate Teaching Assistant (GTA) at the Mathematical Sciences Depart-

ment where he assisted professors and worked at the Math Resource Center for Students

(MaRCS) of the university to help students from all disciplines and programs.

Roland worked under the supervision of Professor Xiaogang Su on a thesis titled ”High-

Dimensional Random Forests”. After obtaining his master’s degree in Spring 2021, Roland

will pursue his doctoral degree in Big Data Analytics at the University of Central Florida

in Orlando Florida starting Fall 2021.

Email address: [email protected]

66

[email protected]

High-Dimensional Random Forests

Documents

Transcript of High-Dimensional Random Forests